date:20150323

Re: [ceph-users] Finding out how much data is in the journal

2015-03-23 Thread Josef Johansson


> On 23 Mar 2015, at 03:58, Haomai Wang  wrote:
> 
> On Mon, Mar 23, 2015 at 2:53 AM, Josef Johansson  > wrote:
>> Hi all!
>> 
>> Trying to figure out how much my journals are used, using SSDs as journals 
>> and SATA-drives as storage, I dive into perf dump.
>> But I can’t figure out why journal_queue_bytes is at constant 0. The only 
>> thing that differs is dirtied in WBThrottle.
> 
> journal_queue_bytes means how much journal data in the queue and is
> waiting for Journal Thread to be processed.
> 
> Still now osd can't tell you how much data in the journal waiting for
> writeback and sync.
> 
Hm, who knows that then?
Is this the WBThrottle value?

No way of knowing how much journal is used at all?

Maybe I thought of this wrong so if I understand you correctly

Data is written to OSD
The journal saves it to the queue
Waits for others to sync the requests as well
Sends a ACK to the client
Starts writing to the filestore buffer
filestore buffer commits when limits and met (inodes/ios-dirtied, 
filestore_sync_max_interval)

So if I’m meeting latency and want to see if my journals are lazy, I should 
indeed look at journal_queue_bytes, if that’s zero, it’s behaving well.

Thanks,
Josef

>> 
>> Maybe I’ve disable that when setting the in-memory debug variables to 0/0?
>> 
>> Thanks,
>> Josef
>> 
>> # ceph --version
>> ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
>> 
>> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep 
>> journal
>>  "journaler": "0\/0",
>>  "journal": "0\/0",
>>  "journaler_allow_split_entries": "true",
>>  "journaler_write_head_interval": "15",
>>  "journaler_prefetch_periods": "10",
>>  "journaler_prezero_periods": "5",
>>  "journaler_batch_interval": "0.001",
>>  "journaler_batch_max": "0",
>>  "mds_kill_journal_at": "0",
>>  "mds_kill_journal_expire_at": "0",
>>  "mds_kill_journal_replay_at": "0",
>>  "osd_journal": "\/var\/lib\/ceph\/osd\/ceph-0\/journal",
>>  "osd_journal_size": "25600",
>>  "filestore_fsync_flushes_journal_data": "false",
>>  "filestore_journal_parallel": "false",
>>  "filestore_journal_writeahead": "false",
>>  "filestore_journal_trailing": "false",
>>  "journal_dio": "true",
>>  "journal_aio": "true",
>>  "journal_force_aio": "false",
>>  "journal_max_corrupt_search": "10485760",
>>  "journal_block_align": "true",
>>  "journal_write_header_frequency": "0",
>>  "journal_max_write_bytes": "10485760",
>>  "journal_max_write_entries": "100",
>>  "journal_queue_max_ops": "300",
>>  "journal_queue_max_bytes": "33554432",
>>  "journal_align_min_size": "65536",
>>  "journal_replay_from": "0",
>>  "journal_zero_on_create": "false",
>>  "journal_ignore_corruption": "false",
>> 
>> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
>> { "WBThrottle": { "bytes_dirtied": 32137216,
>>  "bytes_wb": 0,
>>  "ios_dirtied": 1445,
>>  "ios_wb": 0,
>>  "inodes_dirtied": 491,
>>  "inodes_wb": 0},
>>  "filestore": { "journal_queue_max_ops": 300,
>>  "journal_queue_ops": 0,
>>  "journal_ops": 116105073,
>>  "journal_queue_max_bytes": 33554432,
>>  "journal_queue_bytes": 0,
>>  "journal_bytes": 3160504432839,
>>  "journal_latency": { "avgcount": 116105073,
>>  "sum": 64951.260611000},
>>  "journal_wr": 112261141,
>>  "journal_wr_bytes": { "avgcount": 112261141,
>>  "sum": 3426141528064},
>>  "op_queue_max_ops": 50,
>>  "op_queue_ops": 0,
>>  "ops": 116105073,
>>  "op_queue_max_bytes": 104857600,
>>  "op_queue_bytes": 0,
>>  "bytes": 3159111228243,
>>  "apply_latency": { "avgcount": 116105073,
>>  "sum": 247410.066048000},
>>  "committing": 0,
>>  "commitcycle": 267176,
>>  "commitcycle_interval": { "avgcount": 267176,
>>  "sum": 1873193.631124000},
>>  "commitcycle_latency": { "avgcount": 267176,
>>  "sum": 390421.06299},
>>  "journal_full": 0,
>>  "queue_transaction_latency_avg": { "avgcount": 116105073,
>>  "sum": 378.948923000}},
>>  "leveldb": { "leveldb_get": 699871216,
>>  "leveldb_transaction": 522440246,
>>  "leveldb_compact": 0,
>>  "leveldb_compact_range": 0,
>>  "leveldb_compact_queue_merge": 0,
>>  "leveldb_compact_queue_len": 0},
>>  "mutex-FileJournal::completions_lock": { "wait": { "avgcount": 0,
>>  "sum": 0.0}},
>>  "mutex-FileJournal::finisher_lock": { "wait": { "avgcount": 0,
>>  "sum": 0.0}},
>>  "mutex-FileJournal::write_lock": { "wait": { "avgcount": 0,
>>  "sum": 0.0}},
>>  "mutex-FileJournal::writeq_lock": { "wait": { "avgcount": 0,
>>  "sum": 0.0}},
>>  "mutex-JOS::ApplyManager::apply_lock": { "wait": { "avgcount": 0,
>>  "sum": 0.0}},
>>  "mutex-JOS::ApplyManager::com_lock": { "wait": { "avgcount": 0,
>>  "sum": 0.0}},
>>  "mutex-JOS::SubmitManager::lock": { "wait": { "avgcount": 0,
>>  "sum": 0.0}},
>>

Re: [ceph-users] SSD Hardware recommendation

2015-03-23 Thread Christian Balzer


Hello,

Again refer to my original, old mail:

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html

I was strictly looking at the SMART values, in the case of these 
Intel DC S3700 SSDs the "Host_Writes_32MiB" values.
Which, according to what name implies and what all the references I could
find means exactly that, the writes from the host (the SATA controller) to
the actual SSD.
So no matter what optimizations the SSD does itself and what other things
might be possible with things like O_DSYNC, the combination of all the
things mentioned before in the Ceph/FS stack caused a 12x amplification
(instead of 2x) _before_ hitting the SSD.

And that's where optimizations in Ceph and other components, maybe
avoiding a FS altogether, will be very helpful and welcome.

Regards,

Christian

On Mon, 23 Mar 2015 07:49:41 +0100 (CET) Alexandre DERUMIER wrote:

> >>(not tested, but I think with journal and O_DSYNC writes, it can give
> >>use ssd write amplification)
> 
> also, I think that enterprise ssd with supercapacitor, should be able to
> cache theses o_dsync writes in the ssd buffer, and do bigger writes to
> reduce amplification.
> 
> Don't known how ssd internal algorithms work for this.
> 
> 
> - Mail original -
> De: "aderumier" 
> À: "Christian Balzer" 
> Cc: "ceph-users" 
> Envoyé: Lundi 23 Mars 2015 07:36:48
> Objet: Re: [ceph-users] SSD Hardware recommendation
> 
> Hi, 
> 
> Isn't it in the nature of ssd to have write amplication ? 
> 
> Generaly, they have a erase block size of 128k, 
> 
> so the worst case could be 128/4 = 32x write amplification. 
> 
> (of course ssd algorithms and optimisations reduce this write
> amplification). 
> 
> Now, it could be great to see if it's coming from osd journal or osd
> datas. 
> 
> (not tested, but I think with journal and O_DSYNC writes, it can give
> use ssd write amplification) 
> 
> 
> - Mail original - 
> De: "Christian Balzer"  
> À: "ceph-users"  
> Envoyé: Lundi 23 Mars 2015 03:11:39 
> Objet: Re: [ceph-users] SSD Hardware recommendation 
> 
> On Mon, 23 Mar 2015 02:33:20 +0100 Francois Lafont wrote: 
> 
> > Hi, 
> > 
> > Sorry Christian for my late answer. I was a little busy. 
> > 
> > Christian Balzer a wrote: 
> > 
> > > You're asking the wrong person, as I'm neither a Ceph or kernel 
> > > developer. ^o^ 
> > 
> > No, no, the rest of the message proves to me that I talk to the 
> > right person. ;) 
> > 
> > > Back then Mark Nelson from the Ceph team didn't expect to see those 
> > > numbers as well, but both Mark Wu and I saw them. 
> > > 
> > > Anyways, lets start with the basics and things that are
> > > understandable without any detail knowledge. 
> > > 
> > > Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 
> > > (Since we're talking about SSD cluster here and keep things related
> > > to the question of the OP). 
> > > 
> > > Now a client writes 40MB of data to the cluster. 
> > > Assuming an ideal scenario where all PGs are evenly distributed
> > > (they won't be) and this is totally fresh data (resulting in 10 4MB
> > > Ceph objects), this would mean that each OSD will receive 4MB (10
> > > primary PGs, 10 secondary ones). 
> > > With journals on the same SSD (currently the best way based on
> > > tests), we get a write amplification of 2, as that data is written
> > > both to the journal and the actual storage space. 
> > > 
> > > But as my results in the link above showed, that is very much 
> > > dependent on the write size. With a 4MB block size (the ideal size
> > > for default RBD pools and objects) I saw even slightly less than the
> > > 2x amplifications expected, I assume that was due to caching and PG 
> > > imbalances. 
> > > 
> > > Now my guess what happens with small (4KB) writes is that all these 
> > > small writes do not coalesce sufficiently before being written to
> > > the object on the OSD. 
> > > So up to 1000 4KB writes could happen to that 4MB object (clearly is
> > > it much less than that, but how much I can't tell), resulting in the
> > > same "blocks" being rewritten several times. 
> > 
> > Ok, If understand well, with replication == 2 and journals in the same 
> > disks of the OSDs (I assume that we are talking about storage via
> > block device): 
> > 
> > 1. in theory there is a "write" amplification (between the client side 
> > and the OSDs backend side) equal to 2 x #replication = 4, because data 
> > is written in the journal and after in the OSD storage. 
> > 
> 
> The write amplification on a cluster wide basis is like that, but we're 
> only interested in the amplification on the OSD level (SSD wearout) and 
> there it should be 2 (journal and data). 
> 
> Also keep in mind that both the OSD distribution isn't perfect and 
> that you might have some very hot data (frequently written) that resides 
> only in one Ceph object (4MB), so just on one PG and thus hitting only 3 
> OSDs (replication of 3) all the time, while other OSDs see much les

[ceph-users] Issue with free Inodes

2015-03-23 Thread Kamil Kuramshin

Recently got a problem with OSDs based on SSD disks used in cache tier 
for EC-pool


superuser@node02:~$ df -i
FilesystemInodes   IUsed *IFree* IUse% Mounted on
<...>
/dev/sdb13335808 3335808 *0* 100% 
/var/lib/ceph/osd/ceph-45
/dev/sda13335808 3335808 *0* 100% 
/var/lib/ceph/osd/ceph-46


Now that OSDs are down on each ceph-node and cache tiering is not working.

superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1 
(283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
2015-03-23 10:04:23.640676 7fb105345840  0 
filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
2015-03-23 10:04:23.640735 7fb105345840 -1 
genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: 
unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space 
left on device
2015-03-23 10:04:23.640763 7fb105345840 -1 
filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error: 
(28) No space left on device
2015-03-23 10:04:23.640772 7fb105345840 -1 
filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in 
_detect_fs: (28) No space left on device
2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting 
store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*


In the same time*df -h *is confusing:

superuser@node01:~$ df -h
Filesystem  Size  Used *Avail* Use% Mounted on
<...>
/dev/sda150G   29G *20G*  60% /var/lib/ceph/osd/ceph-45
/dev/sdb150G   27G *21G*  56% /var/lib/ceph/osd/ceph-46


Filesystem used on affected OSDs is EXt4. All OSDs are deployed with 
ceph-deploy:

$ ceph-deploy osd create --zap-disk --fs-type ext4 :


Help me out what it was just test deployment and all EC-pool data was 
lost since I /can't start OSDs/ and ceph cluster/becames degraded /until 
I removed all affected tiered pools (cache & EC)
So this is just my observation of what kind of problems can be faced if 
you choose wrong Filesystem for OSD backend.
And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems 
because both are supporting dynamic inode allocation and this problem 
can't arise with them.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Finding out how much data is in the journal

2015-03-23 Thread Haomai Wang

On Mon, Mar 23, 2015 at 3:04 PM, Josef Johansson  wrote:
>
> On 23 Mar 2015, at 03:58, Haomai Wang  wrote:
>
> On Mon, Mar 23, 2015 at 2:53 AM, Josef Johansson  wrote:
>
> Hi all!
>
> Trying to figure out how much my journals are used, using SSDs as journals
> and SATA-drives as storage, I dive into perf dump.
> But I can’t figure out why journal_queue_bytes is at constant 0. The only
> thing that differs is dirtied in WBThrottle.
>
>
> journal_queue_bytes means how much journal data in the queue and is
> waiting for Journal Thread to be processed.
>
> Still now osd can't tell you how much data in the journal waiting for
> writeback and sync.
>
> Hm, who knows that then?
> Is this the WBThrottle value?

WBThrottle only will tell you the dirtied data in the system buffer
cache. It's not directly related to journal data.

>
> No way of knowing how much journal is used at all?
>
> Maybe I thought of this wrong so if I understand you correctly
>
> Data is written to OSD
> The journal saves it to the queue
> Waits for others to sync the requests as well
> Sends a ACK to the client
> Starts writing to the filestore buffer
> filestore buffer commits when limits and met (inodes/ios-dirtied,
> filestore_sync_max_interval)
>
> So if I’m meeting latency and want to see if my journals are lazy, I should
> indeed look at journal_queue_bytes, if that’s zero, it’s behaving well.

 yes or not, journal_queue_bytes is related to latency but not always
the key to see whether the journal is lazy.


>
> Thanks,
> Josef
>
>
> Maybe I’ve disable that when setting the in-memory debug variables to 0/0?
>
> Thanks,
> Josef
>
> # ceph --version
> ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
>
> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
> journal
>  "journaler": "0\/0",
>  "journal": "0\/0",
>  "journaler_allow_split_entries": "true",
>  "journaler_write_head_interval": "15",
>  "journaler_prefetch_periods": "10",
>  "journaler_prezero_periods": "5",
>  "journaler_batch_interval": "0.001",
>  "journaler_batch_max": "0",
>  "mds_kill_journal_at": "0",
>  "mds_kill_journal_expire_at": "0",
>  "mds_kill_journal_replay_at": "0",
>  "osd_journal": "\/var\/lib\/ceph\/osd\/ceph-0\/journal",
>  "osd_journal_size": "25600",
>  "filestore_fsync_flushes_journal_data": "false",
>  "filestore_journal_parallel": "false",
>  "filestore_journal_writeahead": "false",
>  "filestore_journal_trailing": "false",
>  "journal_dio": "true",
>  "journal_aio": "true",
>  "journal_force_aio": "false",
>  "journal_max_corrupt_search": "10485760",
>  "journal_block_align": "true",
>  "journal_write_header_frequency": "0",
>  "journal_max_write_bytes": "10485760",
>  "journal_max_write_entries": "100",
>  "journal_queue_max_ops": "300",
>  "journal_queue_max_bytes": "33554432",
>  "journal_align_min_size": "65536",
>  "journal_replay_from": "0",
>  "journal_zero_on_create": "false",
>  "journal_ignore_corruption": "false",
>
> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
> { "WBThrottle": { "bytes_dirtied": 32137216,
>  "bytes_wb": 0,
>  "ios_dirtied": 1445,
>  "ios_wb": 0,
>  "inodes_dirtied": 491,
>  "inodes_wb": 0},
>  "filestore": { "journal_queue_max_ops": 300,
>  "journal_queue_ops": 0,
>  "journal_ops": 116105073,
>  "journal_queue_max_bytes": 33554432,
>  "journal_queue_bytes": 0,
>  "journal_bytes": 3160504432839,
>  "journal_latency": { "avgcount": 116105073,
>  "sum": 64951.260611000},
>  "journal_wr": 112261141,
>  "journal_wr_bytes": { "avgcount": 112261141,
>  "sum": 3426141528064},
>  "op_queue_max_ops": 50,
>  "op_queue_ops": 0,
>  "ops": 116105073,
>  "op_queue_max_bytes": 104857600,
>  "op_queue_bytes": 0,
>  "bytes": 3159111228243,
>  "apply_latency": { "avgcount": 116105073,
>  "sum": 247410.066048000},
>  "committing": 0,
>  "commitcycle": 267176,
>  "commitcycle_interval": { "avgcount": 267176,
>  "sum": 1873193.631124000},
>  "commitcycle_latency": { "avgcount": 267176,
>  "sum": 390421.06299},
>  "journal_full": 0,
>  "queue_transaction_latency_avg": { "avgcount": 116105073,
>  "sum": 378.948923000}},
>  "leveldb": { "leveldb_get": 699871216,
>  "leveldb_transaction": 522440246,
>  "leveldb_compact": 0,
>  "leveldb_compact_range": 0,
>  "leveldb_compact_queue_merge": 0,
>  "leveldb_compact_queue_len": 0},
>  "mutex-FileJournal::completions_lock": { "wait": { "avgcount": 0,
>  "sum": 0.0}},
>  "mutex-FileJournal::finisher_lock": { "wait": { "avgcount": 0,
>  "sum": 0.0}},
>  "mutex-FileJournal::write_lock": { "wait": { "avgcount": 0,
>  "sum": 0.0}},
>  "mutex-FileJournal::writeq_lock": { "wait": { "avgcount": 0,
>  "sum": 0.0}},
>  "mutex-JOS::ApplyManager::apply_lock": { "wait": { "avgcount": 0,
>  "sum": 0.0}},
>  "mutex-JO

[ceph-users] add stop_scrub command for ceph

2015-03-23 Thread Xinze Chi

hi ceph:

Currently, there is not a command which can stop scrubbing when
the pg is doing scrub or deep
scrub. What about add a command to support it ? I think this s every
used for system administrator.

 I have add a issue to track it. http://tracker.ceph.com/issues/11202.
-- 
Regards,
xinze
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Christian Balzer


Hello,

This is rather confusing, as cache-tiers are just normal OSDs/pools and
thus should have Ceph objects of around 4MB in size by default.

This is matched on what I see with Ext4 here (normal OSD, not a cache
tier):
---
size:
/dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
inodes:
/dev/sde1  183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0
---

On a more fragmented cluster I see a 5:1 size to inode ratio.

I just can't fathom how there could be 3.3 million inodes (and thus a
close number of files) using 30G, making the average file size below 10
Bytes. 

Something other than your choice of file system is probably at play here.

How fragmented are those SSDs?
What's your default Ceph object size?
Where _are_ those 3 million files in that OSD, are they actually in the
object files like:
-rw-r--r-- 1 root root 4194304 Jan  9 15:27 
/var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

What's your use case, RBD, CephFS, RadosGW?

Regards,

Christian

On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:

> Recently got a problem with OSDs based on SSD disks used in cache tier 
> for EC-pool
> 
> superuser@node02:~$ df -i
> FilesystemInodes   IUsed *IFree* IUse% Mounted on
> <...>
> /dev/sdb13335808 3335808 *0* 100% 
> /var/lib/ceph/osd/ceph-45
> /dev/sda13335808 3335808 *0* 100% 
> /var/lib/ceph/osd/ceph-46
> 
> Now that OSDs are down on each ceph-node and cache tiering is not
> working.
> 
> superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
> 2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1 
> (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
> 2015-03-23 10:04:23.640676 7fb105345840  0 
> filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
> 2015-03-23 10:04:23.640735 7fb105345840 -1 
> genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: 
> unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space 
> left on device
> 2015-03-23 10:04:23.640763 7fb105345840 -1 
> filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error: 
> (28) No space left on device
> 2015-03-23 10:04:23.640772 7fb105345840 -1 
> filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in 
> _detect_fs: (28) No space left on device
> 2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting 
> store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*
> 
> In the same time*df -h *is confusing:
> 
> superuser@node01:~$ df -h
> Filesystem  Size  Used *Avail* Use% Mounted on
> <...>
> /dev/sda150G   29G *20G*
> 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G   27G
> *21G*  56% /var/lib/ceph/osd/ceph-46
> 
> 
> Filesystem used on affected OSDs is EXt4. All OSDs are deployed with 
> ceph-deploy:
> $ ceph-deploy osd create --zap-disk --fs-type ext4 :
> 
> 
> Help me out what it was just test deployment and all EC-pool data was 
> lost since I /can't start OSDs/ and ceph cluster/becames degraded /until 
> I removed all affected tiered pools (cache & EC)
> So this is just my observation of what kind of problems can be faced if 
> you choose wrong Filesystem for OSD backend.
> And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems 
> because both are supporting dynamic inode allocation and this problem 
> can't arise with them.
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD Forece Removal

2015-03-23 Thread Stéphane DUGRAVOT



- Mail original -



Thanks stephane the thing is that those steps needs to be run in the node where 
the osd lives, I dont have that node any more since the operating Systems got 
corrupted so I Couldnt make it work :( 




Jesus, 
On my test cluster, what i have done is (experimental) : 


* Edit crushmap and attach failed osd (osd.2) to a running host (previously 
attached to delta host) 
* reinject crushmap 
* then run the 2 commands : ceph osd down osd.2; ceph osd rm 2 
* After that, ceph osd tree give me : 


* # id weight type name up/down reweight -100 2 root default -1 1 host 
bravo 0 1 osd.0 up 1 -2 1 host charlie 1 1 osd.1 up 1 2 1 osd.2 DNE 
* 
edit crushmap again, suppress the reference to osd.2 (DNE) 
* Reinject, and TADA ! : 


* # id weight type name up/down reweight -100 2 root default -1 1 host 
bravo 0 1 osd.0 up 1 -2 1 host charlie 1 1 osd.1 up 1 

Stephane. 




Thanks 


Jesus Chavez 
SYSTEMS ENGINEER-C.SALES 

jesch...@cisco.com 
Phone: +52 55 5267 3146 
Mobile: +51 1 5538883255 

CCIE - 44433 

On Mar 20, 2015, at 3:49 AM, Stéphane DUGRAVOT < 
stephane.dugra...@univ-lorraine.fr > wrote: 






- Mail original -


Hi all, can anybody tell me how can I force delete osds? the thing is that one 
node got corrupted because of outage, so there is no way to get those osd up 
and back, is there anyway to force the removal from ceph-deploy node? 



Hi, 
Try manual : 


* 
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
 




Thanks 
 



Jesus Chavez 
SYSTEMS ENGINEER-C.SALES 

jesch...@cisco.com 
Phone: +52 55 5267 3146 
Mobile: +51 1 5538883255 

CCIE - 44433


Cisco.com 




Think before you print. 


This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message. 

Please click here for Company Registration Information. 





___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 










___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Giant 0.87 update on CentOs 7

2015-03-23 Thread HEWLETT, Paul (Paul)** CTR **

Hi Steffen

We have recently encountered the errors described below. Initially one must set 
check_obsoletes=1 in the yum priorities.conf file.

However subsequent yum updates cause problems.

The solution we use is to disable the epel repo by default:

  yum-config-manager --disable epel

and explicitly install libunwind:

 yum -y --enablerepo=epel libunwind

Then updates occur cleanly...

yum -y update

Additionally we specify eu.ceph.com in the ceph.repo file. This all works with 
RHEL7.

If one does not do this then the incorrect librbd1, librados2 rpms are 
installed and this triggers a dependency install of the (incorrect) firefly 
rpms.

To recover remove librbd1,librados2:

  yum remove librbd1 librados2

HTH

Regards
Paul Hewlett
Senior Systems Engineer
Velocix, Cambridge
Alcatel-Lucent
t: +44 1223 435893 m: +44 7985327353



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Steffen W 
Sørensen [ste...@me.com]
Sent: 22 March 2015 22:22
To: Ceph Users
Subject: Re: [ceph-users] Giant 0.87 update on CentOs 7

:) Now disabling epel which seems the confusing Repo above just renders me with 
TOs from http://ceph.com… are Ceph.com down 
currently?
http://eu.ceph.com answers currently… properly the trans-atlantic line or my 
provider :/



[root@n1 ~]# yum -y --disablerepo epel --disablerepo ceph-source update
Loaded plugins: fastestmirror, priorities
http://ceph.com/rpm-giant/el7/x86_64/repodata/repomd.xml: [Errno 12] Timeout on 
http://ceph.com/rpm-giant/el7/x86_64/repodata/repomd.xml: (28, 'Connection 
timed out after 30403 milliseconds')
Trying other mirror.
http://ceph.com/rpm-giant/el7/x86_64/repodata/repomd.xml: [Errno 12] Timeout on 
http://ceph.com/rpm-giant/el7/x86_64/repodata/repomd.xml: (28, 'Connection 
timed out after 30042 milliseconds')
Trying other mirror.
…

/Steffen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Kamil Kuramshin

In my case there was cache pool for ec-pool serving RBD-images, and 
object size is 4Mb, and client was an /kernel-rbd /client
each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12 OSDs 
in total



23.03.2015 12:00, Christian Balzer пишет:

Hello,

This is rather confusing, as cache-tiers are just normal OSDs/pools and
thus should have Ceph objects of around 4MB in size by default.

This is matched on what I see with Ext4 here (normal OSD, not a cache
tier):
---
size:
/dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
inodes:
/dev/sde1  183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0
---

On a more fragmented cluster I see a 5:1 size to inode ratio.

I just can't fathom how there could be 3.3 million inodes (and thus a
close number of files) using 30G, making the average file size below 10
Bytes.

Something other than your choice of file system is probably at play here.

How fragmented are those SSDs?
What's your default Ceph object size?
Where _are_ those 3 million files in that OSD, are they actually in the
object files like:
-rw-r--r-- 1 root root 4194304 Jan  9 15:27 
/var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

What's your use case, RBD, CephFS, RadosGW?

Regards,

Christian

On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:


Recently got a problem with OSDs based on SSD disks used in cache tier
for EC-pool

superuser@node02:~$ df -i
FilesystemInodes   IUsed *IFree* IUse% Mounted on
<...>
/dev/sdb13335808 3335808 *0* 100%
/var/lib/ceph/osd/ceph-45
/dev/sda13335808 3335808 *0* 100%
/var/lib/ceph/osd/ceph-46

Now that OSDs are down on each ceph-node and cache tiering is not
working.

superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1
(283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
2015-03-23 10:04:23.640676 7fb105345840  0
filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
2015-03-23 10:04:23.640735 7fb105345840 -1
genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features:
unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space
left on device
2015-03-23 10:04:23.640763 7fb105345840 -1
filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error:
(28) No space left on device
2015-03-23 10:04:23.640772 7fb105345840 -1
filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in
_detect_fs: (28) No space left on device
2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*

In the same time*df -h *is confusing:

superuser@node01:~$ df -h
Filesystem  Size  Used *Avail* Use% Mounted on
<...>
/dev/sda150G   29G *20G*
60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G   27G
*21G*  56% /var/lib/ceph/osd/ceph-46


Filesystem used on affected OSDs is EXt4. All OSDs are deployed with
ceph-deploy:
$ ceph-deploy osd create --zap-disk --fs-type ext4 :


Help me out what it was just test deployment and all EC-pool data was
lost since I /can't start OSDs/ and ceph cluster/becames degraded /until
I removed all affected tiered pools (cache & EC)
So this is just my observation of what kind of problems can be faced if
you choose wrong Filesystem for OSD backend.
And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems
because both are supporting dynamic inode allocation and this problem
can't arise with them.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status

2015-03-23 Thread Xabier Elkano

El 22/03/15 a las 10:55, Saverio Proto escribió:
> Hello,
>
> I started to work with CEPH few weeks ago, I might ask a very newbie
> question, but I could not find an answer in the docs or in the ml
> archive for this.
>
> Quick description of my setup:
> I have a ceph cluster with two servers. Each server has 3 SSD drives I
> use for journal only. To map to different failure domains SAS disks
> that keep a journal to the same SSD drive, I wrote my own crushmap.
> I have now a total of 36OSD. Ceph health returns HEALTH_OK.
> I run the cluster with a couple of pools with size=3 and min_size=3
>
>
> Production operations questions:
> I manually stopped some OSDs to simulate a failure.
>
> As far as I understood, an "OSD down" condition is not enough to make
> CEPH start making new copies of objects. I noticed that I must mark
> the OSD as "out" to make ceph produce new copies.
> As far as I understood min_size=3 puts the object in readonly if there
> are not at least 3 copies of the object available.
>
> Is this behavior correct or I made some mistake creating the cluster ?
> Should I expect ceph to produce automatically a new copy for objects
> when some OSDs are down ?
> There is any option to mark automatically "out" OSDs that go "down" ?
Hi,

you should set this parameter in your ceph config file in mon section:

mon_osd_down_out_interval = 900

to set the interval (in seconds) that ceph will wait to put the OSD as
out and to start making new copies after a down osd is detected. By
default it is set in 600 seconds.

BR,
Xabier
>
> thanks
>
> Saverio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] "store is getting too big" on monitors

2015-03-23 Thread Joao Eduardo Luis


On 02/17/2015 11:13 AM, Mohamed Pakkeer wrote:

Hi Joao,

We followed your instruction to create the store dump

ceph-kvstore-tool /var/lib/ceph/mon/ceph-FOO/store.db list > store.dump'

for above store's location, let's call it $STORE:

for m in osdmap pgmap; do
   for k in first_committed last_committed; do
 ceph-kvstore-tool $STORE get $m $k >> store.dump
   done
done

ceph-kvstore-tool $STORE get pgmap_meta last_osdmap_epoch >> store.dump
ceph-kvstore-tool $STORE get pgmap_meta version >> store.dump


Please find the store dump on the following link.

http://jmp.sh/LUh6iWo



You have over 40k osdmaps in the store.  Ceph usually only keeps 500 (by 
default, iirc), unless the cluster is unhealthy -- in which case the 
monitor will keep all osdmaps as back as the last clean epoch.


As you have 40k I am guessing your cluster has been unhealthy for a 
while.  Once you get the osds to a healthy state, the monitors should 
trim the maps from 40k+ to ~500 or so, and the store will shrink 
significantly.


Please note, when I say 'healthy cluster', in this case, I only mean 
healthy osds.  In short, getting rid of all the osd warning and errors 
in 'ceph health detail' that pertains to osds.


  -Joao




--
Thanks & Regards
K.Mohamed Pakkeer



On Mon, Feb 16, 2015 at 8:14 PM, Joao Eduardo Luis mailto:j...@redhat.com>> wrote:

On 02/16/2015 12:57 PM, Mohamed Pakkeer wrote:


   Hi ceph-experts,

We are getting "store is getting too big" on our test cluster.
Cluster is running with giant release and configured as EC pool
to test
cephFS.

cluster c2a97a2f-fdc7-4eb5-82ef-__70c52f2eceb1
   health HEALTH_WARN too few pgs per osd (0 < min 20);
mon.master01
store is getting too big! 15376 MB >= 15360 MB; mon.master02
store is
getting too big! 15402 MB >= 15360 MB; mon.master03 store is
getting too
big! 15402 MB >= 15360 MB; clock skew detected on mon.master02,
mon.master03
   monmap e3: 3 mons at

{master01=10.1.2.231:6789/0,__master02=10.1.2.232:6789/0,__master03=10.1.2.233:6789/0



>},
election epoch 38, quorum 0,1,2 master01,master02,master03
   osdmap e97396: 552 osds: 552 up, 552 in
pgmap v354736: 0 pgs, 0 pools, 0 bytes data, 0 objects
  8547 GB used, 1953 TB / 1962 TB avail

We tried monitor restart with mon compact on start = true as well as
manual compaction using 'ceph tell mon.FOO compact'. But it didn't
reduce the size of store.db. We already deleted the pools and mds to
start fresh cluster. Do we need to delete the mon and recreate
again or
do we have any solution to reduce the store size?


Could you get us a list of all the keys on the store using
'ceph-kvstore-tool' ?  Instructions on the email you quoted.

Cheers!

   -Joao


Regards,
K.Mohamed Pakkeer



On 12/10/2014 07:30 PM, Kevin Sumner wrote:

 The mons have grown another 30GB each overnight (except for
003?), which
 is quite worrying.  I ran a little bit of testing yesterday
after my
 post, but not a significant amount.

 I wouldn’t expect compact on start to help this situation
based on the
 name since we don’t (shouldn’t?) restart the mons
regularly, but there
 appears to be no documentation on it.  We’re pretty good on
disk space
 on the mons currently, but if that changes, I’ll probably
use this to
 see about bringing these numbers in line.

This is an issue that has been seen on larger clusters, and it
usually
takes a monitor restart, with 'mon compact on start = true' or
manual
compaction 'ceph tell mon.FOO compact' to bring the monitor back
to a
sane disk usage level.

However, I have not been able to reproduce this in order to
track the
source. I'm guessing I lack the scale of the cluster, or the
appropriate
workload (maybe both).

What kind of workload are you running the cluster through? You
mention
cephfs, but do you have any more info you can share that could
help us
reproducing this state?

Sage also fixed an issue that could potentially cause this
(depending on
what is causing it in the first place) [1,2,3]. This bug, #9987,
is due
to a given cached value not being updated, leading to the
monitor not
removing unnecessary data, potentially causing

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Thomas Foster

You could fix this by changing your block size when formatting the
mount-point with the mkfs -b command.  I had this same issue when dealing
with the filesystem using glusterfs and the solution is to either use a
filesystem that allocates inodes automatically or change the block size
when you build the filesystem.  Unfortunately, the only way to fix the
problem that I have seen is to reformat

On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin 
wrote:

>  In my case there was cache pool for ec-pool serving RBD-images, and
> object size is 4Mb, and client was an *kernel-rbd *client
> each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12 OSDs in
> total
>
>
> 23.03.2015 12:00, Christian Balzer пишет:
>
> Hello,
>
> This is rather confusing, as cache-tiers are just normal OSDs/pools and
> thus should have Ceph objects of around 4MB in size by default.
>
> This is matched on what I see with Ext4 here (normal OSD, not a cache
> tier):
> ---
> size:
> /dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
> inodes:
> /dev/sde1  183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0
> ---
>
> On a more fragmented cluster I see a 5:1 size to inode ratio.
>
> I just can't fathom how there could be 3.3 million inodes (and thus a
> close number of files) using 30G, making the average file size below 10
> Bytes.
>
> Something other than your choice of file system is probably at play here.
>
> How fragmented are those SSDs?
> What's your default Ceph object size?
> Where _are_ those 3 million files in that OSD, are they actually in the
> object files like:
> -rw-r--r-- 1 root root 4194304 Jan  9 15:27 
> /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3
>
> What's your use case, RBD, CephFS, RadosGW?
>
> Regards,
>
> Christian
>
> On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:
>
>
>  Recently got a problem with OSDs based on SSD disks used in cache tier
> for EC-pool
>
> superuser@node02:~$ df -i
> FilesystemInodes   IUsed *IFree* IUse% Mounted on
> <...>
> /dev/sdb13335808 3335808 *0* 100%
> /var/lib/ceph/osd/ceph-45
> /dev/sda13335808 3335808 *0* 100%
> /var/lib/ceph/osd/ceph-46
>
> Now that OSDs are down on each ceph-node and cache tiering is not
> working.
>
> superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
> 2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1
> (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
> 2015-03-23 10:04:23.640676 7fb105345840  0
> filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
> 2015-03-23 10:04:23.640735 7fb105345840 -1
> genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features:
> unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space
> left on device
> 2015-03-23 10:04:23.640763 7fb105345840 -1
> filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error:
> (28) No space left on device
> 2015-03-23 10:04:23.640772 7fb105345840 -1
> filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in
> _detect_fs: (28) No space left on device
> 2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting
> store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*
>
> In the same time*df -h *is confusing:
>
> superuser@node01:~$ df -h
> Filesystem  Size  Used *Avail* Use% Mounted on
> <...>
> /dev/sda150G   29G *20G*
> 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G   27G
> *21G*  56% /var/lib/ceph/osd/ceph-46
>
>
> Filesystem used on affected OSDs is EXt4. All OSDs are deployed with
> ceph-deploy:
> $ ceph-deploy osd create --zap-disk --fs-type ext4 :
>
>
> Help me out what it was just test deployment and all EC-pool data was
> lost since I /can't start OSDs/ and ceph cluster/becames degraded /until
> I removed all affected tiered pools (cache & EC)
> So this is just my observation of what kind of problems can be faced if
> you choose wrong Filesystem for OSD backend.
> And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems
> because both are supporting dynamic inode allocation and this problem
> can't arise with them.
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS questions

2015-03-23 Thread John Spray


On 22/03/2015 08:29, Bogdan SOLGA wrote:

Hello, everyone!

I have a few questions related to the CephFS part of Ceph:

  * is it production ready?

Like it says at http://ceph.com/docs/master/cephfs/: " CephFS currently 
lacks a robust ‘fsck’ check and repair function. Please use caution when 
storing important data as the disaster recovery tools are still under 
development".  That page was recently updated.


  * can multiple CephFS be created on the same cluster? The CephFS
creation  page
describes how to create a CephFS using (at least) two pools, but
the mounting 
page does not refer to any pool, when mounting the FS;


Currently you can only have one filesystem per Ceph cluster.


  * besides the pool quota

setting, are there any means by which a CephFS can have a quota
defined? I have found this

document, which is from the Firefly release (and it seems only a
draft), but no other references on the matter.

Yes, when using the fuse client there is a per-directory quota system 
available, although it is not guaranteed to be completely strict. I 
don't think there is any documentation for that, but you can see how to 
use it here:

https://github.com/ceph/ceph/blob/master/qa/workunits/fs/quota/quota.sh


  * this  page
refers to 'mounting only a part of the namespace' -- what is the
namespace referred in the page?

In this context namespace means the filesystem tree.  So "part of the 
namespace" means a subdirectory.


  * can a CephFS be mounted simultaneously from multiple clients?


Yes.


  * what would be the recommended way of creating system users on a
CephFS, if a quota is needed for each user? create a pool for each
user? or?

No recommendation at this stage - it would be interesting for you to try 
some things and let us know how you get on.


Cheers,
John

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Uneven CPU usage on OSD nodes

2015-03-23 Thread f...@univ-lr.fr


Hi Somnath,

Thank you, please find my answers below

Somnath Roy  a écrit le 22/03/15 18:16 :


Hi Frederick,

Need some information here.

 

1. Just to clarify, you are saying it is happening g in 0.87.1 and not 
in Firefly ?


That's a possibility, others running similar hardware (and possibly OS, 
I can ask) confirm they dont have such visible comportment on Firefly.

I'd need to install Firefly on our hosts to be sure.
We run on RHEL.


 


2. Is it happening after some hours of run or just right away ?


It's happening on freshly installed hosts and goes on.


 


3. Please provide ‘perf top’ output of all the OSD nodes.


Here they are :
http://www.4shared.com/photo/S9tvbNKEce/UnevenLoad3-perf.html
http://www.4shared.com/photo/OHfiAtXKba/UnevenLoad3-top.html

The left-hand 'high-cpu' nodes have tmalloc calls able to explain the 
cpu difference. We don't see them on 'low-cpu' nodes :


12,15%  libtcmalloc.so.4.1.2  [.] 
tcmalloc::CentralFreeList::FetchFromSpans


 


4. Provide the ceph.conf file from your OSD node as well.


It's a basic configuration. FSID and IP are removed

[global]
fsid = 589xa9
mon_initial_members = helga
mon_host = X.Y.Z.64
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = X.Y.0.0/16


Regards,
Frederic

 


Thanks & Regards

Somnath

 

*From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *f...@univ-lr.fr

*Sent:* Sunday, March 22, 2015 2:15 AM
*To:* Craig Lewis
*Cc:* Ceph Users
*Subject:* Re: [ceph-users] Uneven CPU usage on OSD nodes

 


Hi Craig,

An uneven primaries distribution was indeed my first thought.
I should have been more explicit on the percentages of the histograms 
I gave, lets see them in detail in a more comprehensive way.


On a 27938 bench objects seen by osdmap, the hosts are distributed 
like that :

20904 host1
21210 host2
20835 host3
20709 host3
That's the number of time they appear (as primary or secondary or 
tertiary).
The distribution is pretty linear, as we don't have more than 0.5% of 
total objects difference between the most and the less used host.


If we now considere the primary host distribution, here is what we have :
7207 host1
6960 host2
6814 host3
6957 host3
That's the number of time each host appears as primary.
Once again, the distribution is correct with less than 1.5% of the 
total entries between the most and the less used host as primary.
I must add that such a distribution is of course observed for the 
secondary and the tertiary copy.


I think we have enough samples to confirms the correct distribution of 
the crush function.
Each host having 25% of chance to be primary, this shouldn't be the 
reason why we observe a higher CPU load. There's must something else


I must add we run 0.87.1 Giant.
Go to a firefly release is an option as the phenomena is not currently 
observed on comparable hardware platforms running 0.80.x
About the memory on hosts, 32GB is just a beginning for the tests. 
We'll add more later.


Frederic


Craig Lewis  
 a écrit le 20/03/15 23:19 :


I would say you're a little light on RAM.  With 4TB disks 70% full, 
I've seen some ceph-osd processes using 3.5GB of RAM during recovery.  
You'll be fine during normal operation, but you might run into issues 
at the worst possible time.


 

I have 8 OSDs per node, and 32G of RAM.  I've had ceph-osd processes 
start swapping, and that's a great way to get them kicked out for 
being unresponsive.


 

 

I'm not a dev, but I can make some wild and uninformed guesses :-) .  
The primary OSD uses more CPU than the replicas, and I suspect that 
you have more primaries on the hot nodes.


 

Since you're testing, try repeating the test on 3 OSD nodes instead of 
4.  If you don't want to run that test, you can generate a histogram 
from ceph pg dump data, and see if there are more primary osds (the 
first one in the acting array) on the hot nodes.


 

 

 

On Wed, Mar 18, 2015 at 7:18 AM, f...@univ-lr.fr 
 mailto:f...@univ-lr.fr>> wrote:


Hi to the ceph-users list !

We're setting up a new Ceph infrastructure :
- 1 MDS admin node
- 4 OSD storage nodes (60 OSDs)
  each of them running a monitor
- 1 client

Each 32GB RAM/16 cores OSD node supports 15 x 4TB SAS OSDs (XFS) and 1 
SSD with 5GB journal partitions, all in JBOD attachement.

Every node has 2x10Gb LACP attachement.
The OSD nodes are freshly installed with puppet then from the admin node
Default OSD weight in the OSD tree
1 test pool with 4096 PGs

During setup phase, we're trying to qualify the performance 
characteristics of our setup.

Rados benchmark are done from a client with these commandes :
rados -p pool -b 4194304 bench 60 write -t 32 --no-cleanup
rados -p pool -b 4194304 bench 60 seq -t 32 --no-cleanup

Each time we observed a recurring phenomena : 2 of the 4 OSD nodes 
have twice the CPU load :

http

[ceph-users] PG calculator queries

2015-03-23 Thread Sreenath BH

Hi,

consider following values for a pool:

Size = 3
OSDs = 400
%Data = 100
Target PGs per OSD = 200 (This is default)

The PG calculator generates number of PGs for this pool as : 32768.

Questions:

1. The Ceph documentation recommends around 100 PGs/OSD, whereas the
calculator takes 200 as default value. Are there any changes in the
recommended value of PGs/OSD?

2. Under "notes" it says:
"Total PG Count" below table will be the count of Primary PG copies.
However, when calculating total PGs per OSD average, you must include
all copies.

However, the number of 200 PGs/OSD already seems to include the
primary as well as replica PGs in a OSD. Is the note a typo mistake or
am I missing something?

thanks,
Sreenath
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Kamil Kuramshin


Yes, I understand that.

The initial purpose of first email was just an advise for new comers. My 
fault was in that I was selected ext4 for SSD disks as backend.
But I  did not foresee that inode number can reach its limit before the 
free space :)


And maybe there must be some sort of warning not only for free space in 
MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes 
for filesystems with static inode allocation  like ext4.
Because if OSD reach inode limit it becames totally unusable and 
immediately goes down, and from that moment there is no way to start it!



23.03.2015 13:42, Thomas Foster пишет:
You could fix this by changing your block size when formatting the 
mount-point with the mkfs -b command.  I had this same issue when 
dealing with the filesystem using glusterfs and the solution is to 
either use a filesystem that allocates inodes automatically or change 
the block size when you build the filesystem.  Unfortunately, the only 
way to fix the problem that I have seen is to reformat


On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin 
mailto:kamil.kurams...@tatar.ru>> wrote:


In my case there was cache pool for ec-pool serving RBD-images,
and object size is 4Mb, and client was an /kernel-rbd /client
each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12
OSDs in total


23.03.2015 12:00, Christian Balzer пишет:

Hello,

This is rather confusing, as cache-tiers are just normal OSDs/pools and
thus should have Ceph objects of around 4MB in size by default.

This is matched on what I see with Ext4 here (normal OSD, not a cache
tier):
---
size:
/dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
inodes:
/dev/sde1  183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0
---

On a more fragmented cluster I see a 5:1 size to inode ratio.

I just can't fathom how there could be 3.3 million inodes (and thus a
close number of files) using 30G, making the average file size below 10
Bytes.

Something other than your choice of file system is probably at play here.

How fragmented are those SSDs?
What's your default Ceph object size?
Where _are_ those 3 million files in that OSD, are they actually in the
object files like:
-rw-r--r-- 1 root root 4194304 Jan  9 15:27 
/var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

What's your use case, RBD, CephFS, RadosGW?

Regards,

Christian

On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:


Recently got a problem with OSDs based on SSD disks used in cache tier
for EC-pool

superuser@node02:~$ df -i
FilesystemInodes   IUsed *IFree* IUse% Mounted on
<...>
/dev/sdb13335808 3335808 *0* 100%
/var/lib/ceph/osd/ceph-45
/dev/sda13335808 3335808 *0* 100%
/var/lib/ceph/osd/ceph-46

Now that OSDs are down on each ceph-node and cache tiering is not
working.

superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1
(283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
2015-03-23 10:04:23.640676 7fb105345840  0
filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
2015-03-23 10:04:23.640735 7fb105345840 -1
genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features:
unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space
left on device
2015-03-23 10:04:23.640763 7fb105345840 -1
filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error:
(28) No space left on device
2015-03-23 10:04:23.640772 7fb105345840 -1
filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in
_detect_fs: (28) No space left on device
2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*

In the same time*df -h *is confusing:

superuser@node01:~$ df -h
Filesystem  Size  Used *Avail* Use% Mounted on
<...>
/dev/sda150G   29G *20G*
60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G   27G
*21G*  56% /var/lib/ceph/osd/ceph-46


Filesystem used on affected OSDs is EXt4. All OSDs are deployed with
ceph-deploy:
$ ceph-deploy osd create --zap-disk --fs-type ext4 :


Help me out what it was just test deployment and all EC-pool data was
lost since I /can't start OSDs/ and ceph cluster/becames degraded /until
I removed all affected tiered pools (cache & EC)
So this is just my observation of what kind of problems can be faced if
you choose wrong Filesystem for OSD backend.
And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems
because both are supporting dynamic inode

Re: [ceph-users] arm cluster install

2015-03-23 Thread Yann Dupont - Veille Techno


Le 22/03/2015 22:44, hp cre a écrit :


Hello Yann,

Thanks for your reply. Unfortunately,  I found it by chance during a 
search, since you didn't include me in the reply, I never got it on my 
email.




Well that wasn't intended, but that's because I replied to the list, 
which is usually the way I do.


I am interested in what you mentioned so far. I'm not looking into 
making any production grade cluster,  just a couple of nodes for 
testing ceph and its failure scenarios.


Current ubuntu and Debian based distributions for Banana pro are based 
on kernel 3.4.103. I see you used a more recent kernel,  did you get 
it ready made or you compiled it yourself ?




I compiled it myself. Since 3.18 kernel, upstreaming efforts of sunxi 
community are paying off (thanks to them), and we now have all pieces 
for having a complete vanilla kernel support , at least for all 
server-relevant components.


I actually have the choice now of attaching the osd disks via either 
sata or usb. In buying those Chinese 16gb ssd disks. They are good for 
i/o but not for write speed.




Well as I said, Sata port on A20 is currently limited regarding write 
speed anyway. So depending on your workload, you may not find a big 
difference really.


Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph cache tier

2015-03-23 Thread Yujian Peng

Hi all,

I have a ceph cluster(0.80.7) in production.
Now I encounter a bottleneck of iosp, so I want to add a cache 
tier with SSDs to provide better I/O performance. Here is the procedure:
1. Create a cache pool
2. Set up a cahce tire
ceph osd tier add cold-storage hot-storage
3. Set cache mode
ceph osd tier cache-mode hot-storage writeback
4. Direct all client traffic from the storage pool to the cache pool
ceph osd tier set-overlay cold-storage hot-storage

There are about 1000 vms(kvm with rbd). 
Can I do this on the fly without stopping any vm?
By the way, all the datas is about 100T, is the cache tier 5T enough?

Any help, is highly appreciated.

Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Rados Gateway and keystone

2015-03-23 Thread ghislain.chevalier

Hi All,

I just would to be sure about keystone configuration for Rados Gateway.

I read the documentation http://ceph.com/docs/master/radosgw/keystone/ and 
http://ceph.com/docs/master/radosgw/config-ref/?highlight=keystone
but I didn't catch if after having configured the rados gateway (ceph.conf) in 
order to use keystone, it becomes mandatory to create all the users in it.

In other words, can a rgw be, at the same, time under keystone control and  
standard radosgw-admin ?
How does it work for S3 users ?

What is the purpose of "rgw s3 auth use keystone" parameter ?

Best regards

- - - - - - - - - - - - - - - - -
Ghislain Chevalier
+33299124432
+33788624370
ghislain.cheval...@orange.com

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] More writes on filestore than on journal ?

2015-03-23 Thread Olivier Bonvalet

Hi,

I'm still trying to find why there is much more write operations on
filestore since Emperor/Firefly than from Dumpling.

So, I add monitoring of all perf counters values from OSD.

From what I see : «filestore.ops» reports an average of 78 operations
per seconds. But, block device monitoring reports an average of 113
operations per seconds (+45%).
please thoses 2 graphs :
- https://daevel.fr/img/firefly/osd-70.filestore-ops.png
- https://daevel.fr/img/firefly/osd-70.sda-ops.png

Do you see what can explain this difference ? (this OSD use XFS)

Thanks,
Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] More writes on blockdevice than on filestore ?

2015-03-23 Thread Olivier Bonvalet

Erg... I sent to fast. Bad title, please read «More writes on
blockdevice than on filestore)


Le lundi 23 mars 2015 à 14:21 +0100, Olivier Bonvalet a écrit :
> Hi,
> 
> I'm still trying to find why there is much more write operations on
> filestore since Emperor/Firefly than from Dumpling.
> 
> So, I add monitoring of all perf counters values from OSD.
> 
> From what I see : «filestore.ops» reports an average of 78 operations
> per seconds. But, block device monitoring reports an average of 113
> operations per seconds (+45%).
> please thoses 2 graphs :
> - https://daevel.fr/img/firefly/osd-70.filestore-ops.png
> - https://daevel.fr/img/firefly/osd-70.sda-ops.png
> 
> Do you see what can explain this difference ? (this OSD use XFS)
> 
> Thanks,
> Olivier
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] add stop_scrub command for ceph

2015-03-23 Thread Sage Weil

On Mon, 23 Mar 2015, Xinze Chi wrote:
> hi ceph:
> 
> Currently, there is not a command which can stop scrubbing when
> the pg is doing scrub or deep
> scrub. What about add a command to support it ? I think this s every
> used for system administrator.
> 
>  I have add a issue to track it. http://tracker.ceph.com/issues/11202.

This sounds like the right solution to me.

I think the main thing we need in order to ensure this works reliably is 
to modify ceph-qa-suite.git/tasks/thrasosds.py and/or ceph_manager.py so 
that it periodically sets/unsets the noscrub flags so that we exercise the 
new cancellation code path.  Do you mind preparing a patch for that too?

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status

2015-03-23 Thread Gregory Farnum

On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto  wrote:
> Hello,
>
> I started to work with CEPH few weeks ago, I might ask a very newbie
> question, but I could not find an answer in the docs or in the ml
> archive for this.
>
> Quick description of my setup:
> I have a ceph cluster with two servers. Each server has 3 SSD drives I
> use for journal only. To map to different failure domains SAS disks
> that keep a journal to the same SSD drive, I wrote my own crushmap.
> I have now a total of 36OSD. Ceph health returns HEALTH_OK.
> I run the cluster with a couple of pools with size=3 and min_size=3
>
>
> Production operations questions:
> I manually stopped some OSDs to simulate a failure.
>
> As far as I understood, an "OSD down" condition is not enough to make
> CEPH start making new copies of objects. I noticed that I must mark
> the OSD as "out" to make ceph produce new copies.
> As far as I understood min_size=3 puts the object in readonly if there
> are not at least 3 copies of the object available.

That is correct, but the default with size 3 is 2 and you probably
want to do that instead. If you have size==min_size on firefly
releases and lose an OSD it can't do recovery so that PG is stuck
without manual intervention. :( This is because of some quirks about
how the OSD peering and recovery works, so you'd be forgiven for
thinking it would recover nicely.
(This is changed in the upcoming Hammer release, but you probably
still want to allow cluster activity when an OSD fails, unless you're
very confident in their uptime and more concerned about durability
than availability.)
-Greg

>
> Is this behavior correct or I made some mistake creating the cluster ?
> Should I expect ceph to produce automatically a new copy for objects
> when some OSDs are down ?
> There is any option to mark automatically "out" OSDs that go "down" ?
>
> thanks
>
> Saverio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Hardware recommendation

2015-03-23 Thread Christian Balzer

On Mon, 23 Mar 2015 11:51:56 +0100 (CET) Alexandre DERUMIER wrote:

> >> the combination of all the
> >>things mentioned before in the Ceph/FS stack caused a 12x amplification
> >>(instead of 2x) _before_ hitting the SSD.
> 
> oh, ok, pretty strange.
> 
>  BTW, is it through ceph-fs ? or rbd/rados ?
> 
See the link below, it was rados bench.
But anything that would generate small writes would cause this, I bet.


> - Mail original -
> De: "Christian Balzer" 
> À: "ceph-users" 
> Cc: "aderumier" 
> Envoyé: Lundi 23 Mars 2015 08:29:03
> Objet: Re: [ceph-users] SSD Hardware recommendation
> 
> Hello, 
> 
> Again refer to my original, old mail: 
> 
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
>  
> 
> I was strictly looking at the SMART values, in the case of these 
> Intel DC S3700 SSDs the "Host_Writes_32MiB" values. 
> Which, according to what name implies and what all the references I
> could find means exactly that, the writes from the host (the SATA
> controller) to the actual SSD. 
> So no matter what optimizations the SSD does itself and what other
> things might be possible with things like O_DSYNC, the combination of
> all the things mentioned before in the Ceph/FS stack caused a 12x
> amplification (instead of 2x) _before_ hitting the SSD. 
> 
> And that's where optimizations in Ceph and other components, maybe 
> avoiding a FS altogether, will be very helpful and welcome. 
> 
> Regards, 
> 
> Christian 
> 
> On Mon, 23 Mar 2015 07:49:41 +0100 (CET) Alexandre DERUMIER wrote: 
> 
> > >>(not tested, but I think with journal and O_DSYNC writes, it can
> > >>give use ssd write amplification) 
> > 
> > also, I think that enterprise ssd with supercapacitor, should be able
> > to cache theses o_dsync writes in the ssd buffer, and do bigger writes
> > to reduce amplification. 
> > 
> > Don't known how ssd internal algorithms work for this. 
> > 
> > 
> > - Mail original - 
> > De: "aderumier"  
> > À: "Christian Balzer"  
> > Cc: "ceph-users"  
> > Envoyé: Lundi 23 Mars 2015 07:36:48 
> > Objet: Re: [ceph-users] SSD Hardware recommendation 
> > 
> > Hi, 
> > 
> > Isn't it in the nature of ssd to have write amplication ? 
> > 
> > Generaly, they have a erase block size of 128k, 
> > 
> > so the worst case could be 128/4 = 32x write amplification. 
> > 
> > (of course ssd algorithms and optimisations reduce this write 
> > amplification). 
> > 
> > Now, it could be great to see if it's coming from osd journal or osd 
> > datas. 
> > 
> > (not tested, but I think with journal and O_DSYNC writes, it can give 
> > use ssd write amplification) 
> > 
> > 
> > - Mail original - 
> > De: "Christian Balzer"  
> > À: "ceph-users"  
> > Envoyé: Lundi 23 Mars 2015 03:11:39 
> > Objet: Re: [ceph-users] SSD Hardware recommendation 
> > 
> > On Mon, 23 Mar 2015 02:33:20 +0100 Francois Lafont wrote: 
> > 
> > > Hi, 
> > > 
> > > Sorry Christian for my late answer. I was a little busy. 
> > > 
> > > Christian Balzer a wrote: 
> > > 
> > > > You're asking the wrong person, as I'm neither a Ceph or kernel 
> > > > developer. ^o^ 
> > > 
> > > No, no, the rest of the message proves to me that I talk to the 
> > > right person. ;) 
> > > 
> > > > Back then Mark Nelson from the Ceph team didn't expect to see
> > > > those numbers as well, but both Mark Wu and I saw them. 
> > > > 
> > > > Anyways, lets start with the basics and things that are 
> > > > understandable without any detail knowledge. 
> > > > 
> > > > Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 
> > > > (Since we're talking about SSD cluster here and keep things
> > > > related to the question of the OP). 
> > > > 
> > > > Now a client writes 40MB of data to the cluster. 
> > > > Assuming an ideal scenario where all PGs are evenly distributed 
> > > > (they won't be) and this is totally fresh data (resulting in 10
> > > > 4MB Ceph objects), this would mean that each OSD will receive 4MB
> > > > (10 primary PGs, 10 secondary ones). 
> > > > With journals on the same SSD (currently the best way based on 
> > > > tests), we get a write amplification of 2, as that data is written 
> > > > both to the journal and the actual storage space. 
> > > > 
> > > > But as my results in the link above showed, that is very much 
> > > > dependent on the write size. With a 4MB block size (the ideal size 
> > > > for default RBD pools and objects) I saw even slightly less than
> > > > the 2x amplifications expected, I assume that was due to caching
> > > > and PG imbalances. 
> > > > 
> > > > Now my guess what happens with small (4KB) writes is that all
> > > > these small writes do not coalesce sufficiently before being
> > > > written to the object on the OSD. 
> > > > So up to 1000 4KB writes could happen to that 4MB object (clearly
> > > > is it much less than that, but how much I can't tell), resulting
> > > > in the same "blocks" being rewritten several times. 
> > > 
> > > Ok, If understand w

Re: [ceph-users] Uneven CPU usage on OSD nodes

2015-03-23 Thread Gregory Farnum

On Mon, Mar 23, 2015 at 4:31 AM, f...@univ-lr.fr  wrote:
> Hi Somnath,
>
> Thank you, please find my answers below
>
> Somnath Roy  a écrit le 22/03/15 18:16 :
>
> Hi Frederick,
>
> Need some information here.
>
>
>
> 1. Just to clarify, you are saying it is happening g in 0.87.1 and not in
> Firefly ?
>
> That's a possibility, others running similar hardware (and possibly OS, I
> can ask) confirm they dont have such visible comportment on Firefly.
> I'd need to install Firefly on our hosts to be sure.
> We run on RHEL.
>
>
>
> 2. Is it happening after some hours of run or just right away ?
>
> It's happening on freshly installed hosts and goes on.
>
>
>
> 3. Please provide ‘perf top’ output of all the OSD nodes.
>
> Here they are :
> http://www.4shared.com/photo/S9tvbNKEce/UnevenLoad3-perf.html
> http://www.4shared.com/photo/OHfiAtXKba/UnevenLoad3-top.html
>
> The left-hand 'high-cpu' nodes have tmalloc calls able to explain the cpu
> difference. We don't see them on 'low-cpu' nodes :
>
> 12,15%  libtcmalloc.so.4.1.2  [.]
> tcmalloc::CentralFreeList::FetchFromSpans

Huh. The tcmalloc (memory allocator) workload should be roughly the
same across all nodes, especially if they have equivalent
distributions of PGs and primariness as you describe. Are you sure
this is a persistent CPU imbalance or are they oscillating? Are there
other processes on some of the nodes which could be requiring memory
from the system?

Either you've found a new bug in our memory allocator or something
else is going on in the system to make it behave differently across
your nodes.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Christian Balzer

On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote:

> Yes, I understand that.
> 
> The initial purpose of first email was just an advise for new comers. My 
> fault was in that I was selected ext4 for SSD disks as backend.
> But I  did not foresee that inode number can reach its limit before the 
> free space :)
> 
> And maybe there must be some sort of warning not only for free space in 
> MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes 
> for filesystems with static inode allocation  like ext4.
> Because if OSD reach inode limit it becames totally unusable and 
> immediately goes down, and from that moment there is no way to start it!
> 
While all that is true and should probably be addressed, please re-read
what I wrote before.

With the 3.3 million inodes used and thus likely as many files (did you
verify this?) and 4MB objects that would make something in the 12TB
ballpark area.

Something very very strange and wrong is going on with your cache tier.

Christian

> 
> 23.03.2015 13:42, Thomas Foster пишет:
> > You could fix this by changing your block size when formatting the 
> > mount-point with the mkfs -b command.  I had this same issue when 
> > dealing with the filesystem using glusterfs and the solution is to 
> > either use a filesystem that allocates inodes automatically or change 
> > the block size when you build the filesystem.  Unfortunately, the only 
> > way to fix the problem that I have seen is to reformat
> >
> > On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin 
> > mailto:kamil.kurams...@tatar.ru>> wrote:
> >
> > In my case there was cache pool for ec-pool serving RBD-images,
> > and object size is 4Mb, and client was an /kernel-rbd /client
> > each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12
> > OSDs in total
> >
> >
> > 23.03.2015 12:00, Christian Balzer пишет:
> >> Hello,
> >>
> >> This is rather confusing, as cache-tiers are just normal
> >> OSDs/pools and thus should have Ceph objects of around 4MB in size by
> >> default.
> >>
> >> This is matched on what I see with Ext4 here (normal OSD, not a
> >> cache tier):
> >> ---
> >> size:
> >> /dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
> >> inodes:
> >> /dev/sde1  183148544 55654 183092890
> >> 1% /var/lib/ceph/osd/ceph-0 ---
> >>
> >> On a more fragmented cluster I see a 5:1 size to inode ratio.
> >>
> >> I just can't fathom how there could be 3.3 million inodes (and
> >> thus a close number of files) using 30G, making the average file size
> >> below 10 Bytes.
> >>
> >> Something other than your choice of file system is probably at
> >> play here.
> >>
> >> How fragmented are those SSDs?
> >> What's your default Ceph object size?
> >> Where _are_ those 3 million files in that OSD, are they actually
> >> in the object files like:
> >> -rw-r--r-- 1 root root 4194304 Jan  9
> >> 15:27 
> >> /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3
> >>
> >> What's your use case, RBD, CephFS, RadosGW?
> >>
> >> Regards,
> >>
> >> Christian
> >>
> >> On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:
> >>
> >>> Recently got a problem with OSDs based on SSD disks used in
> >>> cache tier for EC-pool
> >>>
> >>> superuser@node02:~$ df -i
> >>> FilesystemInodes   IUsed *IFree* IUse%
> >>> Mounted on <...>
> >>> /dev/sdb13335808 3335808 *0* 100%
> >>> /var/lib/ceph/osd/ceph-45
> >>> /dev/sda13335808 3335808 *0* 100%
> >>> /var/lib/ceph/osd/ceph-46
> >>>
> >>> Now that OSDs are down on each ceph-node and cache tiering is not
> >>> working.
> >>>
> >>> superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
> >>> 2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1
> >>> (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd,
> >>> pid 1453465 2015-03-23 10:04:23.640676 7fb105345840  0
> >>> filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic
> >>> 0xef53) 2015-03-23 10:04:23.640735 7fb105345840 -1
> >>> genericfilestorebackend(/var/lib/ceph/osd/ceph-45)
> >>> detect_features: unable to
> >>> create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space left on
> >>> device 2015-03-23 10:04:23.640763 7fb105345840 -1
> >>> filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features
> >>> error: (28) No space left on device
> >>> 2015-03-23 10:04:23.640772 7fb105345840 -1
> >>> filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in
> >>> _detect_fs: (28) No space left on device
> >>> 2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error
> >>> converting store /var/lib/ceph/osd/ceph-45: (28) *No space left on
> >>> device*
> >>>
> >>> In the same time*df -h *is confusing:
> >>>
> >>> superuser@node01:~$ df -h
> >>> Filesystem

Re: [ceph-users] How does crush selects different osds using hash(pg) in diferent iterations

2015-03-23 Thread Gregory Farnum

On Sat, Mar 21, 2015 at 10:46 AM, shylesh kumar  wrote:
> Hi ,
>
> I was going through this simplified crush algorithm given in ceph website.
>
> def crush(pg):
>all_osds = ['osd.0', 'osd.1', 'osd.2', ...]
>result = []
># size is the number of copies; primary+replicas
>while len(result) < size:
>--> r = hash(pg)
>chosen = all_osds[ r % len(all_osds) ]
>if chosen in result:
># OSD can be picked only once
>continue
>result.append(chosen)
>return result
>
> 10:24 PM (51 minutes ago)
> In the line where r = hash(pg) , will it gives the same hash value in every
> iteration ?
> if that is the case we always endup choosing the same osd from the list
> or will the pg number be used as seed for the hashing so that r value
> changes in the next iteration.
>
> Am I missing something really basic ??
> Can somebody please provide me some pointers ?

I'm not sure where this bit of documentation came from, but the
selection process includes the "attempt" number as one of the inputs.
Where the attempt starts at 0 (or 1, I dunno) and increments each time
we try to map a new OSD to the PG.
-Greg

>
>
>
> --
> Thanks,
> Shylesh Kumar M
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can't Start OSD

2015-03-23 Thread Gregory Farnum

On Sun, Mar 22, 2015 at 11:22 AM, Somnath Roy  wrote:
> You should be having replicated copies on other OSDs (disks), so, no need to 
> worry about the data loss. You add a new drive and follow the steps in the 
> following link (either 1 or 2)

Except that's not the case if you only had one copy of the PG, as
seems to be indicated by the "last acting[1]" output all over that
health warning. :/
You certainly should have a copy of the data elsewhere, but that
message means you *didn't*; presumably you had 2 copies of everything
and either your CRUSH map was bad (which should have provoked lots of
warnings?) or you've lost more than one OSD.
-Greg

>
> 1. For manual deployment, 
> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
>
> 2. With ceph-deploy, 
> http://ceph.com/docs/master/rados/deployment/ceph-deploy-osd/
>
> After successful deployment, rebalancing should start and eventually cluster 
> will come to healthy state.
>
> Thanks & Regards
> Somnath
>
>
> -Original Message-
> From: Noah Mehl [mailto:noahm...@combinedpublic.com]
> Sent: Sunday, March 22, 2015 11:15 AM
> To: Somnath Roy
> Cc: ceph-users@lists.ceph.com
> Subject: Re: Can't Start OSD
>
> Somnath,
>
> You are correct, there are dmesg errors about the drive.  How can I replace 
> the drive?  Can I copy all of the readable contents from this drive to a new 
> one?  Because I have the following output from “ceph health detail”
>
> HEALTH_WARN 43 pgs stale; 43 pgs stuck stale pg 7.5b7 is stuck stale for 
> 5954121.993990, current state stale+active+clean, last acting [1] pg 7.42a is 
> stuck stale for 5954121.993885, current state stale+active+clean, last acting 
> [1] pg 7.669 is stuck stale for 5954121.994072, current state 
> stale+active+clean, last acting [1] pg 7.121 is stuck stale for 
> 5954121.993586, current state stale+active+clean, last acting [1] pg 7.4ec is 
> stuck stale for 5954121.993956, current state stale+active+clean, last acting 
> [1] pg 7.1e4 is stuck stale for 5954121.993670, current state 
> stale+active+clean, last acting [1] pg 7.41f is stuck stale for 
> 5954121.993901, current state stale+active+clean, last acting [1] pg 7.59f is 
> stuck stale for 5954121.994024, current state stale+active+clean, last acting 
> [1] pg 7.39 is stuck stale for 5954121.993490, current state 
> stale+active+clean, last acting [1] pg 7.584 is stuck stale for 
> 5954121.994026, current state stale+active+clean, last acting [1] pg 7.fd is 
> stuck stale for 5954121.993600, current state stale+active+clean, last acting 
> [1] pg 7.6fd is stuck stale for 5954121.994158, current state 
> stale+active+clean, last acting [1] pg 7.4b5 is stuck stale for 
> 5954121.993975, current state stale+active+clean, last acting [1] pg 7.328 is 
> stuck stale for 5954121.993840, current state stale+active+clean, last acting 
> [1] pg 7.4a9 is stuck stale for 5954121.993981, current state 
> stale+active+clean, last acting [1] pg 7.569 is stuck stale for 
> 5954121.994046, current state stale+active+clean, last acting [1] pg 7.629 is 
> stuck stale for 5954121.994119, current state stale+active+clean, last acting 
> [1] pg 7.623 is stuck stale for 5954121.994118, current state 
> stale+active+clean, last acting [1] pg 7.6dd is stuck stale for 
> 5954121.994179, current state stale+active+clean, last acting [1] pg 7.3d5 is 
> stuck stale for 5954121.993935, current state stale+active+clean, last acting 
> [1] pg 7.54b is stuck stale for 5954121.994058, current state 
> stale+active+clean, last acting [1] pg 7.3cf is stuck stale for 
> 5954121.993938, current state stale+active+clean, last acting [1] pg 7.c4 is 
> stuck stale for 5954121.993633, current state stale+active+clean, last acting 
> [1] pg 7.178 is stuck stale for 5954121.993719, current state 
> stale+active+clean, last acting [1] pg 7.3b8 is stuck stale for 
> 5954121.993946, current state stale+active+clean, last acting [1] pg 7.b1 is 
> stuck stale for 5954121.993635, current state stale+active+clean, last acting 
> [1] pg 7.5fb is stuck stale for 5954121.994146, current state 
> stale+active+clean, last acting [1] pg 7.236 is stuck stale for 
> 5954121.993801, current state stale+active+clean, last acting [1] pg 7.2f5 is 
> stuck stale for 5954121.993881, current state stale+active+clean, last acting 
> [1] pg 7.ac is stuck stale for 5954121.993643, current state 
> stale+active+clean, last acting [1] pg 7.16d is stuck stale for 
> 5954121.993738, current state stale+active+clean, last acting [1] pg 7.6b7 is 
> stuck stale for 5954121.994223, current state stale+active+clean, last acting 
> [1] pg 7.5ea is stuck stale for 5954121.994166, current state 
> stale+active+clean, last acting [1] pg 7.a3 is stuck stale for 
> 5954121.993654, current state stale+active+clean, last acting [1] pg 7.52d is 
> stuck stale for 5954121.994110, current state stale+active+clean, last acting 
> [1] pg 7.2d8 is stuck stale for 5954121.993904, current state 
> stale+active+clean, last acti

Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status

2015-03-23 Thread Saverio Proto

Hello,

thanks for the answers.

This was exacly what I was looking for:

mon_osd_down_out_interval = 900

I was not waiting long enoght to see my cluster recovering by itself.
That's why I tried to increase min_size, because I did not understand
what min_size was for.

Now that I know what is min_size, I guess the best setting for me is
min_size = 1 because I would like to be able to make I/O operations
even of only 1 copy is left.

Thanks to all for helping !

Saverio



2015-03-23 14:58 GMT+01:00 Gregory Farnum :
> On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto  wrote:
>> Hello,
>>
>> I started to work with CEPH few weeks ago, I might ask a very newbie
>> question, but I could not find an answer in the docs or in the ml
>> archive for this.
>>
>> Quick description of my setup:
>> I have a ceph cluster with two servers. Each server has 3 SSD drives I
>> use for journal only. To map to different failure domains SAS disks
>> that keep a journal to the same SSD drive, I wrote my own crushmap.
>> I have now a total of 36OSD. Ceph health returns HEALTH_OK.
>> I run the cluster with a couple of pools with size=3 and min_size=3
>>
>>
>> Production operations questions:
>> I manually stopped some OSDs to simulate a failure.
>>
>> As far as I understood, an "OSD down" condition is not enough to make
>> CEPH start making new copies of objects. I noticed that I must mark
>> the OSD as "out" to make ceph produce new copies.
>> As far as I understood min_size=3 puts the object in readonly if there
>> are not at least 3 copies of the object available.
>
> That is correct, but the default with size 3 is 2 and you probably
> want to do that instead. If you have size==min_size on firefly
> releases and lose an OSD it can't do recovery so that PG is stuck
> without manual intervention. :( This is because of some quirks about
> how the OSD peering and recovery works, so you'd be forgiven for
> thinking it would recover nicely.
> (This is changed in the upcoming Hammer release, but you probably
> still want to allow cluster activity when an OSD fails, unless you're
> very confident in their uptime and more concerned about durability
> than availability.)
> -Greg
>
>>
>> Is this behavior correct or I made some mistake creating the cluster ?
>> Should I expect ceph to produce automatically a new copy for objects
>> when some OSDs are down ?
>> There is any option to mark automatically "out" OSDs that go "down" ?
>>
>> thanks
>>
>> Saverio
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status

2015-03-23 Thread Gregory Farnum

On Mon, Mar 23, 2015 at 7:17 AM, Saverio Proto  wrote:
> Hello,
>
> thanks for the answers.
>
> This was exacly what I was looking for:
>
> mon_osd_down_out_interval = 900
>
> I was not waiting long enoght to see my cluster recovering by itself.
> That's why I tried to increase min_size, because I did not understand
> what min_size was for.
>
> Now that I know what is min_size, I guess the best setting for me is
> min_size = 1 because I would like to be able to make I/O operations
> even of only 1 copy is left.

I'd strongly recommend leaving it at two — if you reduce it to 1 then
you can lose data by having just one disk die at an inopportune
moment, whereas if you leave it at 2 the system won't accept any
writes to only one hard drive. Leaving it at two the system will still
try and re-replicate back up to three copies after "mon osd down out
interval" time has elapsed from a failure. :)
-Greg

>
> Thanks to all for helping !
>
> Saverio
>
>
>
> 2015-03-23 14:58 GMT+01:00 Gregory Farnum :
>> On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto  wrote:
>>> Hello,
>>>
>>> I started to work with CEPH few weeks ago, I might ask a very newbie
>>> question, but I could not find an answer in the docs or in the ml
>>> archive for this.
>>>
>>> Quick description of my setup:
>>> I have a ceph cluster with two servers. Each server has 3 SSD drives I
>>> use for journal only. To map to different failure domains SAS disks
>>> that keep a journal to the same SSD drive, I wrote my own crushmap.
>>> I have now a total of 36OSD. Ceph health returns HEALTH_OK.
>>> I run the cluster with a couple of pools with size=3 and min_size=3
>>>
>>>
>>> Production operations questions:
>>> I manually stopped some OSDs to simulate a failure.
>>>
>>> As far as I understood, an "OSD down" condition is not enough to make
>>> CEPH start making new copies of objects. I noticed that I must mark
>>> the OSD as "out" to make ceph produce new copies.
>>> As far as I understood min_size=3 puts the object in readonly if there
>>> are not at least 3 copies of the object available.
>>
>> That is correct, but the default with size 3 is 2 and you probably
>> want to do that instead. If you have size==min_size on firefly
>> releases and lose an OSD it can't do recovery so that PG is stuck
>> without manual intervention. :( This is because of some quirks about
>> how the OSD peering and recovery works, so you'd be forgiven for
>> thinking it would recover nicely.
>> (This is changed in the upcoming Hammer release, but you probably
>> still want to allow cluster activity when an OSD fails, unless you're
>> very confident in their uptime and more concerned about durability
>> than availability.)
>> -Greg
>>
>>>
>>> Is this behavior correct or I made some mistake creating the cluster ?
>>> Should I expect ceph to produce automatically a new copy for objects
>>> when some OSDs are down ?
>>> There is any option to mark automatically "out" OSDs that go "down" ?
>>>
>>> thanks
>>>
>>> Saverio
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] More writes on filestore than on journal ?

2015-03-23 Thread Gregory Farnum

On Mon, Mar 23, 2015 at 6:21 AM, Olivier Bonvalet  wrote:
> Hi,
>
> I'm still trying to find why there is much more write operations on
> filestore since Emperor/Firefly than from Dumpling.

Do you have any history around this? It doesn't sound familiar,
although I bet it's because of the WBThrottle and flushing changes.

>
> So, I add monitoring of all perf counters values from OSD.
>
> From what I see : «filestore.ops» reports an average of 78 operations
> per seconds. But, block device monitoring reports an average of 113
> operations per seconds (+45%).
> please thoses 2 graphs :
> - https://daevel.fr/img/firefly/osd-70.filestore-ops.png
> - https://daevel.fr/img/firefly/osd-70.sda-ops.png

That's unfortunate but perhaps not surprising — any filestore op can
change a backing file (which requires hitting both the file and the
inode: potentially two disk seeks), as well as adding entries to the
leveldb instance.
-Greg

>
> Do you see what can explain this difference ? (this OSD use XFS)
>
> Thanks,
> Olivier
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Uneven CPU usage on OSD nodes

2015-03-23 Thread f...@univ-lr.fr


Hi Greg,

the low-/high-CPU comportement is absolutely persistent while a host is 
UP, no oscillation.
But rebooting a node can make its comportment switch low-/high-CPU, as 
seen this morning after checking the BIOS settings (especially numa) 
were the same on 2 hosts.


Hosts are identical, puppetized and dedicated to their OSD-node role.

I don't know if that's a possibility, but third way : the tools 
collect/deliver wrong informations and don't show all the CPU cycles implied


Frederic




Gregory Farnum  a écrit le 23/03/15 15:04 :

On Mon, Mar 23, 2015 at 4:31 AM, f...@univ-lr.fr  wrote:
  

Hi Somnath,

Thank you, please find my answers below

Somnath Roy  a écrit le 22/03/15 18:16 :

Hi Frederick,

Need some information here.



1. Just to clarify, you are saying it is happening g in 0.87.1 and not in
Firefly ?

That's a possibility, others running similar hardware (and possibly OS, I
can ask) confirm they dont have such visible comportment on Firefly.
I'd need to install Firefly on our hosts to be sure.
We run on RHEL.



2. Is it happening after some hours of run or just right away ?

It's happening on freshly installed hosts and goes on.



3. Please provide ‘perf top’ output of all the OSD nodes.

Here they are :
http://www.4shared.com/photo/S9tvbNKEce/UnevenLoad3-perf.html
http://www.4shared.com/photo/OHfiAtXKba/UnevenLoad3-top.html

The left-hand 'high-cpu' nodes have tmalloc calls able to explain the cpu
difference. We don't see them on 'low-cpu' nodes :

12,15%  libtcmalloc.so.4.1.2  [.]
tcmalloc::CentralFreeList::FetchFromSpans



Huh. The tcmalloc (memory allocator) workload should be roughly the
same across all nodes, especially if they have equivalent
distributions of PGs and primariness as you describe. Are you sure
this is a persistent CPU imbalance or are they oscillating? Are there
other processes on some of the nodes which could be requiring memory
from the system?

Either you've found a new bug in our memory allocator or something
else is going on in the system to make it behave differently across
your nodes.
-Greg
  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] More writes on filestore than on journal ?

2015-03-23 Thread Olivier Bonvalet

Hi,

Le lundi 23 mars 2015 à 07:29 -0700, Gregory Farnum a écrit :
> On Mon, Mar 23, 2015 at 6:21 AM, Olivier Bonvalet  wrote:
> > Hi,
> >
> > I'm still trying to find why there is much more write operations on
> > filestore since Emperor/Firefly than from Dumpling.
> 
> Do you have any history around this? It doesn't sound familiar,
> although I bet it's because of the WBThrottle and flushing changes.

I only have history for block device stats and global stats reports by
«ceph status».
When I have upgrade from Dumpling to Firefly (via Emperor), write
operations increased a lot on OSD.
I suppose it's because of WBThrottle too, but can't find any parameter
able to confirm that.


> >
> > So, I add monitoring of all perf counters values from OSD.
> >
> > From what I see : «filestore.ops» reports an average of 78 operations
> > per seconds. But, block device monitoring reports an average of 113
> > operations per seconds (+45%).
> > please thoses 2 graphs :
> > - https://daevel.fr/img/firefly/osd-70.filestore-ops.png
> > - https://daevel.fr/img/firefly/osd-70.sda-ops.png
> 
> That's unfortunate but perhaps not surprising — any filestore op can
> change a backing file (which requires hitting both the file and the
> inode: potentially two disk seeks), as well as adding entries to the
> leveldb instance.
> -Greg
> 

Ok thanks, so this part can be «normal».

> >
> > Do you see what can explain this difference ? (this OSD use XFS)
> >
> > Thanks,
> > Olivier
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Calamari Deployment

2015-03-23 Thread JESUS CHAVEZ ARGUELLES


Does anybody know how to succesful install Calamari in rhel7 ? I have tried the 
vagrant thug without sucesss and it seems like a nightmare there is a Kind of 
Sidur when you do vagrant up where it seems not to find the vm path...

Regards 

Jesus Chavez
SYSTEMS ENGINEER-C.SALES

jesch...@cisco.com
Phone: +52 55 5267 3146
Mobile: +51 1 5538883255

CCIE - 44433___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph courseware development opportunity

2015-03-23 Thread Golden Ink

We are looking for someone to develop a course on a Ceph implementation in a 
large computer manufacturer hybrid or public cloud. The project would involve 
meeting with internal engineers to discover latest developments and applying 
their style and standards to the courseware. Project timeframe is immediate 
start with a development window over the next month or so. Payment is 
commensurate with experience, both with Ceph and with courseware/training 
development. Please respond to this email for more information. Please include 
a brief description of your qualifications.

Thank you.

i...@golden-ink.com
www.golden-ink.com
[Description: Golden Ink very small]



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph's Logo

2015-03-23 Thread Amy Wilson

 




			
	
		
			

	Hello,
	
	My name is Amy Wilson. I am the Director of eBrand Business (a 3D logo design and professional CGI animation studio).
	I've just visited the Ceph website and, I have to say, you have a really nice business over there. My congratulations!
	
	Try our 2D to 3D logos transitions out. - your logo would be amazing with an enhancement.
	We’re proud of our past work - please review some of our clients.
	
	Examples can be found at:www.ebrandbusiness.com
	
	Do you ever think to start selling more by telling your company story in a more effective way?
	
	Visitors today don't have the time to read text - today potential clients watch videos!
	A good presentation tells your prospect all you want to say in seconds - the customer doesn’t need to click or browse through your website.
	
	Take a look at our popular Company Presentations - get one and effectively increase sales!
	
	I would love the opportunity in discussing this more with you.
	Are you available today or tomorrow?
	
	Best regards,
	// Amy Wilson
	Director of eBrand Business
	Making you outstanding
	
	If you do not wish to receive further communication from us, please   UNSUBSCRIBE.
			
		
	


	 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] pool has data but rados ls empty

2015-03-23 Thread jipeng song

hi :

there is a pool , data is not empty (by using 'rados df' or rados stats -p)
,but can not list the objs in that pool (using 'rados ls -p 'or python api)
. how do it happened ? The  pool was created by a normal cmd . By the way
using the c code read and write works find.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mapping users to different rgw pools

2015-03-23 Thread Steffen W Sørensen

My vag understanding is that this is mapped through the zone associated with 
the specific user. So define your desiree pools and zones mapping to the pools 
and assign users to desired regions+zones and thus to different pools per user.


> Den 13/03/2015 kl. 07.48 skrev Sreenath BH :
> 
> Hi all,
> 
> Can one Radow gateway support more than one pool for storing objects?
> 
> And as a follow-up question, is there a way to map different users to
> separate rgw pools so that their obejcts get stored in different
> pools?
> 
> thanks,
> Sreenath
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph User Teething Problems

2015-03-23 Thread Lincoln Bryant

Hi David,

I also see only the RBD pool getting created by default in 0.93.

With regards to resizing placement groups, I believe you can use:
ceph osd pool set [pool name] pg_num
ceph osd pool set [pool name] pgp_num

Be forewarned, this will trigger data migration.

Cheers,
Lincoln

On Mar 4, 2015, at 2:27 PM, Datatone Lists wrote:

> I have been following ceph for a long time. I have yet to put it into
> service, and I keep coming back as btrfs improves and ceph reaches
> higher version numbers.
> 
> I am now trying ceph 0.93 and kernel 4.0-rc1.
> 
> Q1) Is it still considered that btrfs is not robust enough, and that
> xfs should be used instead? [I am trying with btrfs].
> 
> I followed the manual deployment instructions on the web site 
> (http://ceph.com/docs/master/install/manual-deployment/) and I managed
> to get a monitor and several osds running and apparently working. The
> instructions fizzle out without explaining how to set up mds. I went
> back to mkcephfs and got things set up that way. The mds starts.
> 
> [Please don't mention ceph-deploy]
> 
> The first thing that I noticed is that (whether I set up mon and osds
> by following the manual deployment, or using mkcephfs), the correct
> default pools were not created.
> 
> bash-4.3# ceph osd lspools
> 0 rbd,
> bash-4.3# 
> 
> I get only 'rbd' created automatically. I deleted this pool, and
> re-created data, metadata and rbd manually. When doing this, I had to
> juggle with the pg- num in order to avoid the 'too many pgs for osd'.
> I have three osds running at the moment, but intend to add to these
> when I have some experience of things working reliably. I am puzzled,
> because I seem to have to set the pg-num for the pool to a number that
> makes (N-pools x pg-num)/N-osds come to the right kind of number. So
> this implies that I can't really expand a set of pools by adding osds
> at a later date. 
> 
> Q2) Is there any obvious reason why my default pools are not getting
> created automatically as expected?
> 
> Q3) Can pg-num be modified for a pool later? (If the number of osds is 
> increased dramatically).
> 
> Finally, when I try to mount cephfs, I get a mount 5 error.
> 
> "A mount 5 error typically occurs if a MDS server is laggy or if it
> crashed. Ensure at least one MDS is up and running, and the cluster is
> active + healthy".
> 
> My mds is running, but its log is not terribly active:
> 
> 2015-03-04 17:47:43.177349 7f42da2c47c0  0 ceph version 0.93 
> (bebf8e9a830d998eeaab55f86bb256d4360dd3c4), process ceph-mds, pid 4110
> 2015-03-04 17:47:43.182716 7f42da2c47c0 -1 mds.-1.0 log_to_monitors 
> {default=true}
> 
> (This is all there is in the log).
> 
> I think that a key indicator of the problem must be this from the
> monitor log:
> 
> 2015-03-04 16:53:20.715132 7f3cd0014700  1
> mon.ceph-mon-00@0(leader).mds e1 warning, MDS mds.?
> [2001:8b0::5fb3::1fff::9054]:6800/4036 up but filesystem
> disabled
> 
> (I have added the '' sections to obscure my ip address)
> 
> Q4) Can you give me an idea of what is wrong that causes the mds to not
> play properly?
> 
> I think that there are some typos on the manual deployment pages, for
> example:
> 
> ceph-osd id={osd-num}
> 
> This is not right. As far as I am aware it should be:
> 
> ceph-osd -i {osd-num}
> 
> An observation. In principle, setting things up manually is not all
> that complicated, provided that clear and unambiguous instructions are
> provided. This simple piece of documentation is very important. My view
> is that the existing manual deployment instructions gets a bit confused
> and confusing when it gets to the osd setup, and the mds setup is
> completely absent.
> 
> For someone who knows, this would be a fairly simple and fairly quick 
> operation to review and revise this part of the documentation. I
> suspect that this part suffers from being really obvious stuff to the
> well initiated. For those of us closer to the start, this forms the
> ends of the threads that have to be picked up before the journey can be
> made.
> 
> Very best regards,
> David
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Write IO Problem

2015-03-23 Thread Rottmann Jonas

Hi,

We have a huge write IO Problem in our preproductive Ceph Cluster. First our 
Hardware:

4 OSD Nodes with:

Supermicro X10 Board
32GB DDR4 RAM
2x Intel Xeon E5-2620
LSI SAS 9300-8i Host Bus Adapter
Intel Corporation 82599EB 10-Gigabit
2x Intel SSDSA2CT040G3 in software raid 1 for system

Disks:
2x Samsung EVO 840 1TB

So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only added 
nodiratime)

Benchmarking one disk alone gives good values:

dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s

Fio 8k libaio depth=32:
write: io=488184KB, bw=52782KB/s, iops=5068 , runt=  9249msec

Here our ceph.conf (pretty much standard):

[global]
fsid = 89191a54-740a-46c7-a325-0899ab32fd1d
mon initial members = cephasp41,ceph-monitor41
mon host = 172.30.10.15,172.30.10.19
public network = 172.30.10.0/24
cluster network = 172.30.10.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

#Default is 1GB, which is fine for us
#osd journal size = {n}

#Only needed if ext4 comes to play
#filestore xattr use omap = true

osd pool default size = 3  # Write an object n times.
osd pool default min size = 2 # Allow writing n copy in a degraded state.

#Set individual per pool by a formula
#osd pool default pg num = {n}
#osd pool default pgp num = {n}
#osd crush chooseleaf type = {n}


When I benchmark the cluster with "rbd bench-write rbd/fio" I get pretty good 
results:
elapsed:18  ops:   262144  ops/sec: 14466.30  bytes/sec: 59253946.11

If I for example bench i.e. with fio with rbd engine, I get very poor results:

[global]
ioengine=rbd
clientname=admin
pool=rbd
rbdname=fio
invalidate=0# mandatory
rw=randwrite
bs=512k

[rbd_iodepth32]
iodepth=32

RESULTS:
ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec

Also if I mount the rbd with kernel as rbd0, format it with ext4 and then do a 
dd on it, its not that good:
"dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc"
RESULT:
1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s

I also tried presenting an rbd image with tgtd, mount it onto VMWare ESXi and 
test it in a vm, there I got only round about 50 iops with 4k, and writing 
sequentiell 25Mbytes.
With NFS the read sequential values are good (400Mbyte/s) but writing only 
25Mbyte/s.

What I tried tweaking so far:

Intel NIC optimazitions:
/etc/sysctl.conf

# Increase system file descriptor limit
fs.file-max = 65535

# Increase system IP port range to allow for more concurrent connections
net.ipv4.ip_local_port_range = 1024 65000

# -- 10gbe tuning from Intel ixgb driver README -- #

# turn off selective ACK and timestamps
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0

# memory allocation min/pressure/max.
# read buffer, write buffer, and buffer space
net.ipv4.tcp_rmem = 1000 1000 1000
net.ipv4.tcp_wmem = 1000 1000 1000
net.ipv4.tcp_mem = 1000 1000 1000

net.core.rmem_max = 524287
net.core.wmem_max = 524287
net.core.rmem_default = 524287
net.core.wmem_default = 524287
net.core.optmem_max = 524287
net.core.netdev_max_backlog = 30

AND

setpci -v -d 8086:10fb e6.b=2e


Setting tunables to firefly:
ceph osd crush tunables firefly

Setting scheduler to noop:
This basically stopped IO on the cluster, and I had to revert it 
and restart some of the osds with requests stuck

And I tried moving the monitor from an VM to the Hardware where the OSDs run.


Any suggestions where to look, or what could cause that problem?
(because I can't believe your loosing that much performance through ceph 
replication)

Thanks in advance.

If you need any info please tell me.

Mit freundlichen Grüßen/Kind regards
Jonas Rottmann
Systems Engineer

FIS-ASP Application Service Providing und
IT-Outsourcing GmbH
Röthleiner Weg 4
D-97506 Grafenrheinfeld
Phone: +49 (9723) 9188-568
Fax: +49 (9723) 9188-600

email: j.rottm...@fis-asp.de   web: www.fis-asp.de

Geschäftsführer Robert Schuhmann
Registergericht Schweinfurt HRB 3865
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Multiple OSD's in a Each node with replica 2

2015-03-23 Thread Azad Aliyar

I  have a doubt . In a scenario (3nodes x 4osd each x 2replica)  I tested
with a node down and as long as you have space available all objects were
there.

Is it possible all replicas of an object to be saved in the same node?

Is it possible to lose any?

Is there a mechanism that prevents replicas to be stored in another osd in
the same node?

I would love someone to answer it and any information is highly appreciated.
-- 
   Warm Regards,  Azad Aliyar
 Linux Server Engineer
 *Email* :  azad.ali...@sparksupport.com   *|*   *Skype* :   spark.azad
 

  
3rd Floor, Leela Infopark, Phase -2,Kakanad, Kochi-30, Kerala, India
*Phone*:+91 484 6561696 , *Mobile*:91-8129270421.   *Confidentiality
Notice:* Information in this e-mail is proprietary to SparkSupport. and is
intended for use only by the addressed, and may contain information that is
privileged, confidential or exempt from disclosure. If you are not the
intended recipient, you are notified that any use of this information in
any manner is strictly prohibited. Please delete this mail & notify us
immediately at i...@sparksupport.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deploy ceph

2015-03-23 Thread kefu chai

hi harry,

your question is more related to the ceph-user, so i am replying to ceph-users.

On Wed, Mar 18, 2015 at 12:02 AM, harryxiyou  wrote:
> Hi all,
>
> I wanna deploy Ceph and I see the doc here
> (http://docs.ceph.com/docs/dumpling/start/quick-start-preflight/). I
> wonder how could I install ceph from latest source codes install of
> specific software libraries like `sudo apt-get install ceph-deploy`?
> After I compile ceph source codes I would get ceph-deploy, right? No
> need other tools or dependencies?

ceph-deploy is living in ceph/ceph-deploy repo[0]. it's basically
written in python. most ceph binaries and helper scripts are living in
ceph/ceph[1]. you might want to read
https://github.com/ceph/ceph/blob/master/README.md for more details on
how to build it and its build-time and run-time dependencies.

HTH.

---
[0] https://github.com/ceph/ceph-deploy
[1] https://github.com/ceph/ceph
>
>
> Thanks, Harry
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] The project of ceph client file system porting from Linux to AIX

2015-03-23 Thread Ketor D

Hi Dennis,
  I am interested in your project.
  I wrote a Win32 cephfs client https://github.com/ceph/ceph-dokan.
  But ceph-dokan runs in user-mode. I see you port code from
kernel cephfs, are you planning to write a kernel mode AIX-cephfs?

Thanks!


2015-03-04 17:59 GMT+08:00 Dennis Chen :
> Hello,
>
> The ceph cluster now can only be used by Linux system AFAICT, so I
> planed to port the ceph client file system from Linux to AIX as a
> tiered storage solution in that platform. Below is the source code
> repository I've done, which is still in progress. 3 important modules:
>
> 1. aixker: maintain a uniform kernel API beteween the Linux and AIX
> 2. net: as a data transfering layer between the client and cluster
> 3. fs: as an adaptor to make the AIX can recognize the Linux file system.
>
> https://github.com/Dennis-Chen1977/aix-cephfs
>
> Welcome any comments or anything...
>
> --
> Den
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Hammer OSD Shard Tuning Test Results

2015-03-23 Thread Vu Pham


>>>  This would be a good thing to bring up in the meeting on Wednesday.
>yes !
>

Yes, we can discuss details on Wed's call.


>
>>>I wonder how much effect flow-control and header/data crc had.
>yes. I known that sommath also disable crc for his bench
>

I disabled ceph's header/data crc for both simplemessenger & xio but 
didn't run with header/data crc enable to see the differences.


>
>>>Were the simplemessenger tests on IPoIB or native?
>
>I think it's native, as the Vu Pham benchmark was done on mellanox 
>sx1012 (ethernet).
>And xio messenger was on Roce (rdma over ethernet)
>
Yes, it's native for simplemessenger and RoCE for xio messenger


>
>>>How big was the RBD volume that was created (could some data be
>>>locally cached)? Did network data transfer statistics match the
>>>benchmark result numbers?
>
Single OSD on 4GB ramdisk, journal size is 256MB.

RBD volume is only 128MB; however, I ran fio_rbd client with direct=1 to 
bypass local buffer cache
Yes, the network data xfer statistics match the benchmark result 
numbers.
I used "dstat -N " to monitor the network data statistics

I also turned all cores @ full speed and applied one parameter tuning 
for Mellanox ConnectX-3 HCA mlx4_core driver
(options mlx4_core  log_num_mgm_entry_size=-7)

$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
2601000

$ for c in ./cpu[0-55]*; do echo 2601000 > 
${c}/cpufreq/scaling_min_freq; done



>
>
>
>I @cc Vu pham to this mail maybe it'll be able to give us answer.
>
>
>Note that I'll have same mellanox switches (sx1012) for my production 
>cluster in some weeks,
>so I'll be able to reproduce the bench. (with 2x10 cores 3,1ghz nodes 
>and clients).
>
>
>
>
>
>- Mail original -
>De: "Mark Nelson" 
>À: "aderumier" 
>Cc: "ceph-devel" , "ceph-users" 
>
>Envoyé: Lundi 2 Mars 2015 15:39:24
>Objet: Re: [ceph-users] Ceph Hammer OSD Shard Tuning Test Results
>
>Hi Alex,
>
>I see I even responded in the same thread! This would be a good thing
>to bring up in the meeting on Wednesday. Those are far faster single
>OSD results than I've been able to muster with simplemessenger. I
>wonder how much effect flow-control and header/data crc had. He did
>have quite a bit more CPU (Intel specs say 14 cores @ 2.6GHz, 28 if you
>count hyperthreading). Depending on whether there were 1 or 2 CPUs in
>that node, that might be around 3x the CPU power I have here.
>
>Some other thoughts: Were the simplemessenger tests on IPoIB or native?
>How big was the RBD volume that was created (could some data be
>locally cached)? Did network data transfer statistics match the
>benchmark result numbers?
>
>I also did some tests on fdcache, though just glancing at the results 
>it
>doesn't look like tweaking those parameters had much effect.
>
>Mark
>
>On 03/01/2015 08:38 AM, Alexandre DERUMIER wrote:
>>  Hi Mark,
>>
>>  I found an previous bench from Vu Pham (it's was about 
>>simplemessenger vs xiomessenger)
>>
>>  http://www.spinics.net/lists/ceph-devel/msg22414.html
>>
>>  and with 1 osd, he was able to reach 105k iops with simple messenger
>>
>>  . ~105k iops (4K random read, 20 cores used, numjobs=8, iopdepth=32)
>>
>>  this was with more powerfull nodes, but the difference seem to be 
>>quite huge
>>
>>
>>
>>  - Mail original -
>>  De: "aderumier" 
>>  À: "Mark Nelson" 
>>  Cc: "ceph-devel" , "ceph-users" 
>>
>>  Envoyé: Vendredi 27 Février 2015 07:10:42
>>  Objet: Re: [ceph-users] Ceph Hammer OSD Shard Tuning Test Results
>>
>>  Thanks Mark for the results,
>>  default values seem to be quite resonable indeed.
>>
>>
>>  I also wonder is cpu frequency can have an impact on latency or not.
>>  I'm going to benchmark on dual xeon 10-cores 3,1ghz nodes in coming 
>>weeks,
>>  I'll try replay your benchmark to compare
>>
>>
>>
>>  - Mail original -
>>  De: "Mark Nelson" 
>>  À: "ceph-devel" , "ceph-users" 
>>
>>  Envoyé: Jeudi 26 Février 2015 05:44:15
>>  Objet: [ceph-users] Ceph Hammer OSD Shard Tuning Test Results
>>
>>  Hi Everyone,
>>
>>  In the Ceph Dumpling/Firefly/Hammer SSD/Memstore performance 
>>comparison
>>  thread, Alexandre DERUMIER wondered if changing the default shard and
>>  threads per shard OSD settings might have a positive effect on
>>  performance in our tests. I went back and used one of the PCIe SSDs
>>  from our previous tests to experiment with a recent master pull. I
>>  wanted to know how performance was affected by changing these 
>>parameters
>>  and also to validate that the default settings still appear to be 
>>correct.
>>
>>  I plan to conduct more tests (potentially across multiple SATA SSDs 
>>in
>>  the same box), but these initial results seem to show that the 
>>default
>>  settings that were chosen are quite reasonable.
>>
>>  Mark
>>
>>  ___
>>  ceph-users mailing list
>>  ceph-users@lists.ceph.com
>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>  ___
>>  ceph-user

Re: [ceph-users] who is using radosgw with civetweb?

2015-03-23 Thread Axel Dunkel

Sage,

we use apache as a filter for security and additional functionality
reasons. I do like the idea, but we'd need some kind of interface to
filter/modify/process requests.

Best regards
Axel Dunkel

-Ursprüngliche Nachricht-
Von: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] Im Auftrag von Sage Weil
Gesendet: Mittwoch, 25. Februar 2015 20:32
An: ceph-us...@ceph.com; ceph-de...@vger.kernel.org
Betreff: who is using radosgw with civetweb?

Hey,

We are considering switching to civetweb (the embedded/standalone rgw web
server) as the primary supported RGW frontend instead of the current
apache + mod-fastcgi or mod-proxy-fcgi approach.  "Supported" here means
both the primary platform the upstream development focuses on and what the
downstream Red Hat product will officially support.

How many people are using RGW standalone using the embedded civetweb
server instead of apache?  In production?  At what scale?  What
version(s) (civetweb first appeared in firefly and we've backported most
fixes).

Have you seen any problems?  Any other feedback?  The hope is to (vastly)
simplify deployment.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster on docker containers

2015-03-23 Thread Pavel V. Kaygorodov

Hi!

I'm using ceph cluster, packed to a number of docker containers.
There are two things, which you need to know:

1. Ceph OSDs are using FS attributes, which may not be supported by filesystem 
inside docker container, so you need to mount external directory inside a 
container to store OSD data.
2. Ceph monitors must have static external IP-s, so you have to use lxc-conf 
directives to use static IP-s inside containers.


With best regards,
  Pavel.


> 6 марта 2015 г., в 10:15, Sumit Gaur  написал(а):
> 
> Hi
> I need to know if Ceph has any Docker story. What I am not abel to find if 
> there are any predefined steps for ceph cluster to be deployed on Docker 
> containers.
> 
> Thanks
> sumit
>  
>  
> <201503061614748_BEI0XT4N.gif>
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] arm cluster install

2015-03-23 Thread hp cre

Yann, Thanks for the info. Its been a great help.
On 23 Mar 2015 14:44, "Yann Dupont - Veille Techno" <
veilletechno-i...@univ-nantes.fr> wrote:

> Le 22/03/2015 22:44, hp cre a écrit :
>
>>
>> Hello Yann,
>>
>> Thanks for your reply. Unfortunately,  I found it by chance during a
>> search, since you didn't include me in the reply, I never got it on my
>> email.
>>
>>
> Well that wasn't intended, but that's because I replied to the list, which
> is usually the way I do.
>
>  I am interested in what you mentioned so far. I'm not looking into making
>> any production grade cluster,  just a couple of nodes for testing ceph and
>> its failure scenarios.
>>
>> Current ubuntu and Debian based distributions for Banana pro are based on
>> kernel 3.4.103. I see you used a more recent kernel,  did you get it ready
>> made or you compiled it yourself ?
>>
>>
> I compiled it myself. Since 3.18 kernel, upstreaming efforts of sunxi
> community are paying off (thanks to them), and we now have all pieces for
> having a complete vanilla kernel support , at least for all server-relevant
> components.
>
>  I actually have the choice now of attaching the osd disks via either sata
>> or usb. In buying those Chinese 16gb ssd disks. They are good for i/o but
>> not for write speed.
>>
>>
> Well as I said, Sata port on A20 is currently limited regarding write
> speed anyway. So depending on your workload, you may not find a big
> difference really.
>
> Cheers,
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Uneven CPU usage on OSD nodes

2015-03-23 Thread Somnath Roy

Yes, we are also facing similar issue on load (and running after some time). 
This is a tcmalloc behavior.
You can try setting the following env variable to a bigger value say 128MB or 
so.

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES

This env variable is supposed to alleviate the issue but what we found in the 
Ubuntu 14.04 version of tcmalloc this env variable is noop. This was a bug in 
tcmalloc which is been fixed in latest tcmalloc code base.
Not sure about RHEL though. In that case, you may want to try with latest 
tcmalloc. Just replacing LD_LIBRARY_PATH to the new tcmalloc location should 
work good.

Latest Ceph master has support for jemalloc and you may want to try with that 
if this is your test cluster.

Another point, I saw the node consuming more cpus has more memory pressure as 
well (and that’s why tcmalloc also having that issue). Can you give us output 
of ‘ceph osd tree’ to check if the load distribution is even ? Also, check if 
those systems are swapping or not.

Hope this helps.


Thanks & Regards
Somnath

From: f...@univ-lr.fr [mailto:f...@univ-lr.fr]
Sent: Monday, March 23, 2015 4:31 AM
To: Somnath Roy
Cc: Ceph Users
Subject: Re: [ceph-users] Uneven CPU usage on OSD nodes

Hi Somnath,

Thank you, please find my answers below

Somnath Roy  a écrit 
le 22/03/15 18:16 :
Hi Frederick,
Need some information here.

1. Just to clarify, you are saying it is happening g in 0.87.1 and not in 
Firefly ?
That's a possibility, others running similar hardware (and possibly OS, I can 
ask) confirm they dont have such visible comportment on Firefly.
I'd need to install Firefly on our hosts to be sure.
We run on RHEL.


2. Is it happening after some hours of run or just right away ?
It's happening on freshly installed hosts and goes on.


3. Please provide ‘perf top’ output of all the OSD nodes.
Here they are :
http://www.4shared.com/photo/S9tvbNKEce/UnevenLoad3-perf.html
http://www.4shared.com/photo/OHfiAtXKba/UnevenLoad3-top.html

The left-hand 'high-cpu' nodes have tmalloc calls able to explain the cpu 
difference. We don't see them on 'low-cpu' nodes :

12,15%  libtcmalloc.so.4.1.2  [.] tcmalloc::CentralFreeList::FetchFromSpans


4. Provide the ceph.conf file from your OSD node as well.
It's a basic configuration. FSID and IP are removed

[global]
fsid = 589xa9
mon_initial_members = helga
mon_host = X.Y.Z.64
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = X.Y.0.0/16


Regards,
Frederic



Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
f...@univ-lr.fr
Sent: Sunday, March 22, 2015 2:15 AM
To: Craig Lewis
Cc: Ceph Users
Subject: Re: [ceph-users] Uneven CPU usage on OSD nodes

Hi Craig,

An uneven primaries distribution was indeed my first thought.
I should have been more explicit on the percentages of the histograms I gave, 
lets see them in detail in a more comprehensive way.

On a 27938 bench objects seen by osdmap, the hosts are distributed like that :
20904 host1
21210 host2
20835 host3
20709 host3
That's the number of time they appear (as primary or secondary or tertiary).
The distribution is pretty linear, as we don't have more than 0.5% of total 
objects difference between the most and the less used host.

If we now considere the primary host distribution, here is what we have :
7207 host1
6960 host2
6814 host3
6957 host3
That's the number of time each host appears as primary.
Once again, the distribution is correct with less than 1.5% of the total 
entries between the most and the less used host as primary.
I must add that such a distribution is of course observed for the secondary and 
the tertiary copy.

I think we have enough samples to confirms the correct distribution of the 
crush function.
Each host having 25% of chance to be primary, this shouldn't be the reason why 
we observe a higher CPU load. There's must something else

I must add we run 0.87.1 Giant.
Go to a firefly release is an option as the phenomena is not currently observed 
on comparable hardware platforms running 0.80.x
About the memory on hosts, 32GB is just a beginning for the tests. We'll add 
more later.

Frederic


Craig Lewis  a 
écrit le 20/03/15 23:19 :
I would say you're a little light on RAM.  With 4TB disks 70% full, I've seen 
some ceph-osd processes using 3.5GB of RAM during recovery.  You'll be fine 
during normal operation, but you might run into issues at the worst possible 
time.

I have 8 OSDs per node, and 32G of RAM.  I've had ceph-osd processes start 
swapping, and that's a great way to get them kicked out for being unresponsive.


I'm not a dev, but I can make some wild and uninformed guesses :-) .  The 
primary OSD uses more CPU than the replicas, and I suspect that you have more 
primaries on the hot nodes.

Since you're testing, try repeating

Re: [ceph-users] Multiple OSD's in a Each node with replica 2

2015-03-23 Thread Robert LeBlanc

I don't have a fresh cluster on hand to double check, but the default is to
select a different host for each replica. You can adjust that to fit your
needs, we are using cabinet as the selection criteria so that we can lose
an entire cabinet of storage and still function.

In order to store multiple replicas on the same node, you will need to
change this to osd from host.

Please see http://ceph.com/docs/master/rados/operations/crush-map/

On Tue, Mar 3, 2015 at 7:39 PM, Azad Aliyar 
wrote:

>
> I  have a doubt . In a scenario (3nodes x 4osd each x 2replica)  I tested
> with a node down and as long as you have space available all objects were
> there.
>
> Is it possible all replicas of an object to be saved in the same node?
>
> Is it possible to lose any?
>
> Is there a mechanism that prevents replicas to be stored in another osd in
> the same node?
>
> I would love someone to answer it and any information is highly
> appreciated.
> --
>Warm Regards,  Azad Aliyar
>  Linux Server Engineer
>  *Email* :  azad.ali...@sparksupport.com   *|*   *Skype* :   spark.azad
>  
> 
> 
> 3rd Floor, Leela Infopark, Phase
> -2,Kakanad, Kochi-30, Kerala, India  *Phone*:+91 484 6561696 , 
> *Mobile*:91-8129270421.
>   *Confidentiality Notice:* Information in this e-mail is proprietary to
> SparkSupport. and is intended for use only by the addressed, and may
> contain information that is privileged, confidential or exempt from
> disclosure. If you are not the intended recipient, you are notified that
> any use of this information in any manner is strictly prohibited. Please
> delete this mail & notify us immediately at i...@sparksupport.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RADOS Gateway Maturity

2015-03-23 Thread Jerry Lam

Hi Chris and Craig,

Thank you for sharing your experience with me about S3 API RADOS gateway!

Jerry Lam
Senior Software Developer, Big Data

Ontario Institute for Cancer Research
MaRS Centre
661 University Avenue
Suite 510
Toronto, Ontario
Canada M5G 0A3

Email:   jerry@oicr.on.ca
Toll-free: 1-866-678-6427
Twitter: @OICR_news

www.oicr.on.ca

This message and any attachments may contain confidential and/or privileged 
information for the sole use of the intended recipient. Any review or 
distribution by anyone other than the person for whom it was originally 
intended is strictly prohibited. If you have received this message in error, 
please contact the sender and delete all copies. Opinions, conclusions or other 
information contained in this message may not be that of the organization.

From: Chris Jones mailto:cjo...@cloudm2.com>>
Date: Friday, March 20, 2015 at 7:28 PM
To: Craig Lewis mailto:cle...@centraldesktop.com>>
Cc: Jerry Lam mailto:jerry@oicr.on.ca>>, 
"ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] RADOS Gateway Maturity

Hi Jerry,

We using RGW and RBD in our OpenStack clusters and as stand alone clusters. We 
have six large clusters and adding more. Most of any issues we have faced have 
been self inflicted such as not currently supporting bucket names like host 
names. Some S3 tools only work that way which causes some of our developer 
customers grief. We are addressing that. We have built extensive testing 
frameworks around S3 RGW testing using a OpenStack or AWS EC2 or Google Cloud 
Platform to dynamically spin up worker nodes to distribute load for stressing 
and performance monitoring.

I'm actually building a project called IQStack that will be released on github 
soon that does that plus other OpenStack testing for scalability. Anyway, there 
may be some incompatibilities depending on the feature set but most can be 
abstracted away and addressed.

Just as a footnote, I finished running long load testing against AWS S3 and 
RGW. After tuning our load balancers, firewall rules and a few other tweaks I 
was able to get parity with AWS S3 up to 10GbE (max size of my load balancer I 
was testing with). We use several CDNs on video clips so my tests were with 2MB 
byte-range requests.

Thanks,
Chris

On Fri, Mar 20, 2015 at 6:06 PM, Craig Lewis 
mailto:cle...@centraldesktop.com>> wrote:
I have found a few incompatibilities, but so far they're all on the Ceph side.  
One example I remember was having to change the way we delete objects.  The 
function we originally used fetches a list of object versions, and deletes all 
versions.  Ceph is implementing objects versions now (I believe that'll ship 
with Hammer), so we had to call a different function to delete the object 
without iterating over the versions.

AFAIK, that code should work fine if we point it at Amazon.  I haven't tried it 
though.


I've been using RGW (with replication) in production for 2 years now, although 
I'm not large.  So far, all of my RGW issues have been Ceph issues.  Most of my 
issues are caused by my under-powered hardware, or shooting myself in the foot 
with aggressive optimizations.  Things are better with my journals on SSD, but 
the best thing I did was slow down with my changes.  For example, I have 7 OSD 
nodes and 72 OSDs.  When I add new OSDs, I add a couple at a time instead of 
adding all the disks in a node at once.  Guess how I learned that lesson. :-)



On Wed, Mar 18, 2015 at 10:03 AM, Jerry Lam 
mailto:jerry@oicr.on.ca>> wrote:
Hi Chris,

Thank you for your reply.
We are also thinking about using the S3 API but we are concerned about how 
compatible it is with the real S3. For instance, we would like to design the 
system using pre-signed URL for storing some objects. I read the ceph 
documentation, it does not mention if it supports it or not.

My question is do you guys find that the code using the RADOS S3 API can easily 
run in Amazon S3 without any change? If no, how much effort it is needed to 
make it compatible?

Best Regards,

Jerry
From: Chris Jones mailto:cjo...@cloudm2.com>>
Date: Tuesday, March 17, 2015 at 4:39 PM
To: Jerry Lam mailto:jerry@oicr.on.ca>>
Cc: "ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] RADOS Gateway Maturity

Hi Jerry,

I currently work at Bloomberg and we currently have a very large Ceph 
installation in production and we use the S3 compatible API for rados gateway. 
We are also re-architecting our new RGW and evaluating a different Apache 
configuration for a little better performance. We only use replicas right now, 
no erasure coding yet. Actually, you can take a look at our current 
configuration at https://github.com/bloomberg/chef-bcpc.

-Chris

On Tue, Mar 17, 2015 at 10:40 AM, Jerry Lam 
mailto:jerry@oicr.on.ca>> wrote:
Hi Ceph user,

I’m new to Ceph but I ne

[ceph-users] CRUSH decompile failes

2015-03-23 Thread Robert LeBlanc

I was trying to decompile and edit the CRUSH map to adjust the CRUSH
rules. My first attempt created a map that would decompile, but I
could not recompile the CRUSH even if didn't modify it. When trying to
download the CRUSH fresh, now the decompile fails.

[root@nodezz ~]# ceph osd getmap -o map.crush
got osdmap epoch 12792
[root@nodezz ~]# crushtool -d map.crush -o map
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
  what():  buffer::malformed_input: bad magic number
*** Caught signal (Aborted) **
 in thread 7f889ed24780
 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
 1: crushtool() [0x4f4542]
 2: (()+0xf130) [0x7f889df97130]
 3: (gsignal()+0x39) [0x7f889cfd05c9]
 4: (abort()+0x148) [0x7f889cfd1cd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
 6: (()+0x5e946) [0x7f889d8d2946]
 7: (()+0x5e973) [0x7f889d8d2973]
 8: (()+0x5eb9f) [0x7f889d8d2b9f]
 9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
 10: (main()+0x1e0e) [0x4ead4e]
 11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
 12: crushtool() [0x4ee5a9]
2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal (Aborted) **
 in thread 7f889ed24780

 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
 1: crushtool() [0x4f4542]
 2: (()+0xf130) [0x7f889df97130]
 3: (gsignal()+0x39) [0x7f889cfd05c9]
 4: (abort()+0x148) [0x7f889cfd1cd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
 6: (()+0x5e946) [0x7f889d8d2946]
 7: (()+0x5e973) [0x7f889d8d2973]
 8: (()+0x5eb9f) [0x7f889d8d2b9f]
 9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
 10: (main()+0x1e0e) [0x4ead4e]
 11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
 12: crushtool() [0x4ee5a9]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
   -14> 2015-03-23 12:46:34.633547 7f889ed24780  5 asok(0x3229cc0)
register_command perfcounters_dump hook 0x322be00
   -13> 2015-03-23 12:46:34.633580 7f889ed24780  5 asok(0x3229cc0)
register_command 1 hook 0x322be00
   -12> 2015-03-23 12:46:34.633587 7f889ed24780  5 asok(0x3229cc0)
register_command perf dump hook 0x322be00
   -11> 2015-03-23 12:46:34.633596 7f889ed24780  5 asok(0x3229cc0)
register_command perfcounters_schema hook 0x322be00
   -10> 2015-03-23 12:46:34.633604 7f889ed24780  5 asok(0x3229cc0)
register_command 2 hook 0x322be00
-9> 2015-03-23 12:46:34.633609 7f889ed24780  5 asok(0x3229cc0)
register_command perf schema hook 0x322be00
-8> 2015-03-23 12:46:34.633615 7f889ed24780  5 asok(0x3229cc0)
register_command perf reset hook 0x322be00
-7> 2015-03-23 12:46:34.633639 7f889ed24780  5 asok(0x3229cc0)
register_command config show hook 0x322be00
-6> 2015-03-23 12:46:34.633654 7f889ed24780  5 asok(0x3229cc0)
register_command config set hook 0x322be00
-5> 2015-03-23 12:46:34.633661 7f889ed24780  5 asok(0x3229cc0)
register_command config get hook 0x322be00
-4> 2015-03-23 12:46:34.633672 7f889ed24780  5 asok(0x3229cc0)
register_command config diff hook 0x322be00
-3> 2015-03-23 12:46:34.633685 7f889ed24780  5 asok(0x3229cc0)
register_command log flush hook 0x322be00
-2> 2015-03-23 12:46:34.633698 7f889ed24780  5 asok(0x3229cc0)
register_command log dump hook 0x322be00
-1> 2015-03-23 12:46:34.633711 7f889ed24780  5 asok(0x3229cc0)
register_command log reopen hook 0x322be00
 0> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal
(Aborted) **
 in thread 7f889ed24780

 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
 1: crushtool() [0x4f4542]
 2: (()+0xf130) [0x7f889df97130]
 3: (gsignal()+0x39) [0x7f889cfd05c9]
 4: (abort()+0x148) [0x7f889cfd1cd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
 6: (()+0x5e946) [0x7f889d8d2946]
 7: (()+0x5e973) [0x7f889d8d2973]
 8: (()+0x5eb9f) [0x7f889d8d2b9f]
 9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
 10: (main()+0x1e0e) [0x4ead4e]
 11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
 12: crushtool() [0x4ee5a9]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent   500
  max_new 1000
  log_file
---

Re: [ceph-users] CephFS questions

2015-03-23 Thread Bogdan SOLGA

Hello, John!

Thank you very much for your reply and for the provided information! As a
follow-up to your email, a few other questions have arisen:

   - is the http://ceph.com/docs/master/cephfs/ page referring to the
   current release version (Giant) or to the HEAD (Hammer) version? if it's
   referring to Giant -- are there any major improvements and fixes for CephFS
   included in the (upcoming) Hammer release?


   - the 'one filesystem per Ceph cluster' sounds like a (possible)
   drawback, from the flexibility point of view. Is this something which will
   be (or is currently) worked on?


   - regarding the system users created on a CephFS -- if it's still not
   production ready (according to the first replied bullet), I guess I'll try
   the Ceph block device functionality, as it seems more appropriate to my
   needs. Of course, I will post any bugs to the bug tracker.

Thanks, again!
Kind regards,
Bogdan


On Mon, Mar 23, 2015 at 12:47 PM, John Spray  wrote:

> On 22/03/2015 08:29, Bogdan SOLGA wrote:
>
>> Hello, everyone!
>>
>> I have a few questions related to the CephFS part of Ceph:
>>
>>   * is it production ready?
>>
>>  Like it says at http://ceph.com/docs/master/cephfs/: " CephFS currently
> lacks a robust ‘fsck’ check and repair function. Please use caution when
> storing important data as the disaster recovery tools are still under
> development".  That page was recently updated.
>
>>
>>   * can multiple CephFS be created on the same cluster? The CephFS
>> creation  page
>> describes how to create a CephFS using (at least) two pools, but
>> the mounting 
>> page does not refer to any pool, when mounting the FS;
>>
>>  Currently you can only have one filesystem per Ceph cluster.
>
>>
>>   * besides the pool quota
>> > #set-pool-quotas>
>> setting, are there any means by which a CephFS can have a quota
>> defined? I have found this
>> > Cephfs_quota_support>
>> document, which is from the Firefly release (and it seems only a
>> draft), but no other references on the matter.
>>
>>  Yes, when using the fuse client there is a per-directory quota system
> available, although it is not guaranteed to be completely strict. I don't
> think there is any documentation for that, but you can see how to use it
> here:
> https://github.com/ceph/ceph/blob/master/qa/workunits/fs/quota/quota.sh
>
>>
>>   * this  page
>> refers to 'mounting only a part of the namespace' -- what is the
>> namespace referred in the page?
>>
>>  In this context namespace means the filesystem tree.  So "part of the
> namespace" means a subdirectory.
>
>>
>>   * can a CephFS be mounted simultaneously from multiple clients?
>>
>>  Yes.
>
>>
>>   * what would be the recommended way of creating system users on a
>> CephFS, if a quota is needed for each user? create a pool for each
>> user? or?
>>
>>  No recommendation at this stage - it would be interesting for you to try
> some things and let us know how you get on.
>
> Cheers,
> John
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH decompile failes

2015-03-23 Thread Robert LeBlanc

Ok, so the decompile error is because I didn't download the CRUSH map
(found that out using hexdump), but I still can't compile an
unmodified CRUSH map.

[root@nodezz ~]# crushtool -d map.crush -o map
[root@nodezz ~]# crushtool -c map -o map.crush
map:105 error: parse error at ''

For some reason it doesn't like the rack definition. I can move things
around, like putting root before it and it always chokes on the first
rack definition no matter which one it is.

On Mon, Mar 23, 2015 at 12:53 PM, Robert LeBlanc  wrote:
> I was trying to decompile and edit the CRUSH map to adjust the CRUSH
> rules. My first attempt created a map that would decompile, but I
> could not recompile the CRUSH even if didn't modify it. When trying to
> download the CRUSH fresh, now the decompile fails.
>
> [root@nodezz ~]# ceph osd getmap -o map.crush
> got osdmap epoch 12792
> [root@nodezz ~]# crushtool -d map.crush -o map
> terminate called after throwing an instance of 'ceph::buffer::malformed_input'
>   what():  buffer::malformed_input: bad magic number
> *** Caught signal (Aborted) **
>  in thread 7f889ed24780
>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>  1: crushtool() [0x4f4542]
>  2: (()+0xf130) [0x7f889df97130]
>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>  4: (abort()+0x148) [0x7f889cfd1cd8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>  6: (()+0x5e946) [0x7f889d8d2946]
>  7: (()+0x5e973) [0x7f889d8d2973]
>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>  10: (main()+0x1e0e) [0x4ead4e]
>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>  12: crushtool() [0x4ee5a9]
> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal (Aborted) **
>  in thread 7f889ed24780
>
>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>  1: crushtool() [0x4f4542]
>  2: (()+0xf130) [0x7f889df97130]
>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>  4: (abort()+0x148) [0x7f889cfd1cd8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>  6: (()+0x5e946) [0x7f889d8d2946]
>  7: (()+0x5e973) [0x7f889d8d2973]
>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>  10: (main()+0x1e0e) [0x4ead4e]
>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>  12: crushtool() [0x4ee5a9]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> --- begin dump of recent events ---
>-14> 2015-03-23 12:46:34.633547 7f889ed24780  5 asok(0x3229cc0)
> register_command perfcounters_dump hook 0x322be00
>-13> 2015-03-23 12:46:34.633580 7f889ed24780  5 asok(0x3229cc0)
> register_command 1 hook 0x322be00
>-12> 2015-03-23 12:46:34.633587 7f889ed24780  5 asok(0x3229cc0)
> register_command perf dump hook 0x322be00
>-11> 2015-03-23 12:46:34.633596 7f889ed24780  5 asok(0x3229cc0)
> register_command perfcounters_schema hook 0x322be00
>-10> 2015-03-23 12:46:34.633604 7f889ed24780  5 asok(0x3229cc0)
> register_command 2 hook 0x322be00
> -9> 2015-03-23 12:46:34.633609 7f889ed24780  5 asok(0x3229cc0)
> register_command perf schema hook 0x322be00
> -8> 2015-03-23 12:46:34.633615 7f889ed24780  5 asok(0x3229cc0)
> register_command perf reset hook 0x322be00
> -7> 2015-03-23 12:46:34.633639 7f889ed24780  5 asok(0x3229cc0)
> register_command config show hook 0x322be00
> -6> 2015-03-23 12:46:34.633654 7f889ed24780  5 asok(0x3229cc0)
> register_command config set hook 0x322be00
> -5> 2015-03-23 12:46:34.633661 7f889ed24780  5 asok(0x3229cc0)
> register_command config get hook 0x322be00
> -4> 2015-03-23 12:46:34.633672 7f889ed24780  5 asok(0x3229cc0)
> register_command config diff hook 0x322be00
> -3> 2015-03-23 12:46:34.633685 7f889ed24780  5 asok(0x3229cc0)
> register_command log flush hook 0x322be00
> -2> 2015-03-23 12:46:34.633698 7f889ed24780  5 asok(0x3229cc0)
> register_command log dump hook 0x322be00
> -1> 2015-03-23 12:46:34.633711 7f889ed24780  5 asok(0x3229cc0)
> register_command log reopen hook 0x322be00
>  0> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal
> (Aborted) **
>  in thread 7f889ed24780
>
>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>  1: crushtool() [0x4f4542]
>  2: (()+0xf130) [0x7f889df97130]
>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>  4: (abort()+0x148) [0x7f889cfd1cd8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>  6: (()+0x5e946) [0x7f889d8d2946]
>  7: (()+0x5e973) [0x7f889d8d2973]
>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>  10: (main()+0x1e0e) [0x4ead4e]
>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>  12: crushtool() [0x4ee5a9]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> --- logging levels ---
>0/ 5 none
>0/ 1 lockdep
>0/ 1 context
>1/ 1 crush
>1/ 5 mds
>1/ 5 mds_balancer
>

Re: [ceph-users] CRUSH decompile failes

2015-03-23 Thread Sage Weil

On Mon, 23 Mar 2015, Robert LeBlanc wrote:
> Ok, so the decompile error is because I didn't download the CRUSH map
> (found that out using hexdump), but I still can't compile an
> unmodified CRUSH map.
> 
> [root@nodezz ~]# crushtool -d map.crush -o map
> [root@nodezz ~]# crushtool -c map -o map.crush
> map:105 error: parse error at ''
> 
> For some reason it doesn't like the rack definition. I can move things
> around, like putting root before it and it always chokes on the first
> rack definition no matter which one it is.

This was fixed after v0.93... it works with current master and will 
work with hammer v0.94.

Thanks!
sage


> 
> On Mon, Mar 23, 2015 at 12:53 PM, Robert LeBlanc  wrote:
> > I was trying to decompile and edit the CRUSH map to adjust the CRUSH
> > rules. My first attempt created a map that would decompile, but I
> > could not recompile the CRUSH even if didn't modify it. When trying to
> > download the CRUSH fresh, now the decompile fails.
> >
> > [root@nodezz ~]# ceph osd getmap -o map.crush
> > got osdmap epoch 12792
> > [root@nodezz ~]# crushtool -d map.crush -o map
> > terminate called after throwing an instance of 
> > 'ceph::buffer::malformed_input'
> >   what():  buffer::malformed_input: bad magic number
> > *** Caught signal (Aborted) **
> >  in thread 7f889ed24780
> >  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
> >  1: crushtool() [0x4f4542]
> >  2: (()+0xf130) [0x7f889df97130]
> >  3: (gsignal()+0x39) [0x7f889cfd05c9]
> >  4: (abort()+0x148) [0x7f889cfd1cd8]
> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
> >  6: (()+0x5e946) [0x7f889d8d2946]
> >  7: (()+0x5e973) [0x7f889d8d2973]
> >  8: (()+0x5eb9f) [0x7f889d8d2b9f]
> >  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
> >  10: (main()+0x1e0e) [0x4ead4e]
> >  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
> >  12: crushtool() [0x4ee5a9]
> > 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal (Aborted) **
> >  in thread 7f889ed24780
> >
> >  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
> >  1: crushtool() [0x4f4542]
> >  2: (()+0xf130) [0x7f889df97130]
> >  3: (gsignal()+0x39) [0x7f889cfd05c9]
> >  4: (abort()+0x148) [0x7f889cfd1cd8]
> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
> >  6: (()+0x5e946) [0x7f889d8d2946]
> >  7: (()+0x5e973) [0x7f889d8d2973]
> >  8: (()+0x5eb9f) [0x7f889d8d2b9f]
> >  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
> >  10: (main()+0x1e0e) [0x4ead4e]
> >  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
> >  12: crushtool() [0x4ee5a9]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >
> > --- begin dump of recent events ---
> >-14> 2015-03-23 12:46:34.633547 7f889ed24780  5 asok(0x3229cc0)
> > register_command perfcounters_dump hook 0x322be00
> >-13> 2015-03-23 12:46:34.633580 7f889ed24780  5 asok(0x3229cc0)
> > register_command 1 hook 0x322be00
> >-12> 2015-03-23 12:46:34.633587 7f889ed24780  5 asok(0x3229cc0)
> > register_command perf dump hook 0x322be00
> >-11> 2015-03-23 12:46:34.633596 7f889ed24780  5 asok(0x3229cc0)
> > register_command perfcounters_schema hook 0x322be00
> >-10> 2015-03-23 12:46:34.633604 7f889ed24780  5 asok(0x3229cc0)
> > register_command 2 hook 0x322be00
> > -9> 2015-03-23 12:46:34.633609 7f889ed24780  5 asok(0x3229cc0)
> > register_command perf schema hook 0x322be00
> > -8> 2015-03-23 12:46:34.633615 7f889ed24780  5 asok(0x3229cc0)
> > register_command perf reset hook 0x322be00
> > -7> 2015-03-23 12:46:34.633639 7f889ed24780  5 asok(0x3229cc0)
> > register_command config show hook 0x322be00
> > -6> 2015-03-23 12:46:34.633654 7f889ed24780  5 asok(0x3229cc0)
> > register_command config set hook 0x322be00
> > -5> 2015-03-23 12:46:34.633661 7f889ed24780  5 asok(0x3229cc0)
> > register_command config get hook 0x322be00
> > -4> 2015-03-23 12:46:34.633672 7f889ed24780  5 asok(0x3229cc0)
> > register_command config diff hook 0x322be00
> > -3> 2015-03-23 12:46:34.633685 7f889ed24780  5 asok(0x3229cc0)
> > register_command log flush hook 0x322be00
> > -2> 2015-03-23 12:46:34.633698 7f889ed24780  5 asok(0x3229cc0)
> > register_command log dump hook 0x322be00
> > -1> 2015-03-23 12:46:34.633711 7f889ed24780  5 asok(0x3229cc0)
> > register_command log reopen hook 0x322be00
> >  0> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal
> > (Aborted) **
> >  in thread 7f889ed24780
> >
> >  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
> >  1: crushtool() [0x4f4542]
> >  2: (()+0xf130) [0x7f889df97130]
> >  3: (gsignal()+0x39) [0x7f889cfd05c9]
> >  4: (abort()+0x148) [0x7f889cfd1cd8]
> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
> >  6: (()+0x5e946) [0x7f889d8d2946]
> >  7: (()+0x5e973) [0x7f889d8d2973]
> >  8: (()+0x5eb9f) [0x7f889d8d2b9f]
> >  9: (CrushWrapper::decode(ceph::buffer::list::iter

Re: [ceph-users] CRUSH decompile failes

2015-03-23 Thread Robert LeBlanc

OK, sorry for all the quick e-mails, but I got it to compile. For some
reason there are a few errors from decompiling the CRUSH map.

1. The decompiled map has "alg straw2" which is not vaild, removing
the 2 lets it compile
2. The hosts have weight 0.000, which I don't think prevents the map
from compiling, but will cause other issues.

I created the rack entries on the command line and moved the host
buckets to the racks, then exported the CRUSH to modify the rules.

ceph osd crush add-bucket racka rack
ceph osd crush add-bucket rackb rack
ceph osd crush move nodev rack=racka
ceph osd crush move nodew rack=racka
ceph osd crush move nodex rack=rackb
ceph osd crush move nodey rack=rackb
ceph osd crush move racka root=default
ceph osd crush move rackb root=default
ceph osd crush add-bucket ssd-racka rack
ceph osd crush add-bucket ssd-rackb rack
ceph osd crush move ssd-racka root=ssd
ceph osd crush move ssd-rackb root=ssd
ceph osd crush move nodev-ssd rack=ssd-racka
ceph osd crush move nodew-ssd rack=ssd-racka
ceph osd crush move nodex-ssd rack=ssd-rackb
ceph osd crush move nodey-ssd rack=ssd-rackb

Just saw the e-mail from Sage saying that is all fixed after .93
(which we are on). Saving for posterity's sake. Thanks Sage!

On Mon, Mar 23, 2015 at 1:09 PM, Robert LeBlanc  wrote:
> Ok, so the decompile error is because I didn't download the CRUSH map
> (found that out using hexdump), but I still can't compile an
> unmodified CRUSH map.
>
> [root@nodezz ~]# crushtool -d map.crush -o map
> [root@nodezz ~]# crushtool -c map -o map.crush
> map:105 error: parse error at ''
>
> For some reason it doesn't like the rack definition. I can move things
> around, like putting root before it and it always chokes on the first
> rack definition no matter which one it is.
>
> On Mon, Mar 23, 2015 at 12:53 PM, Robert LeBlanc  wrote:
>> I was trying to decompile and edit the CRUSH map to adjust the CRUSH
>> rules. My first attempt created a map that would decompile, but I
>> could not recompile the CRUSH even if didn't modify it. When trying to
>> download the CRUSH fresh, now the decompile fails.
>>
>> [root@nodezz ~]# ceph osd getmap -o map.crush
>> got osdmap epoch 12792
>> [root@nodezz ~]# crushtool -d map.crush -o map
>> terminate called after throwing an instance of 
>> 'ceph::buffer::malformed_input'
>>   what():  buffer::malformed_input: bad magic number
>> *** Caught signal (Aborted) **
>>  in thread 7f889ed24780
>>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>>  1: crushtool() [0x4f4542]
>>  2: (()+0xf130) [0x7f889df97130]
>>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>>  4: (abort()+0x148) [0x7f889cfd1cd8]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>>  6: (()+0x5e946) [0x7f889d8d2946]
>>  7: (()+0x5e973) [0x7f889d8d2973]
>>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>>  10: (main()+0x1e0e) [0x4ead4e]
>>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>>  12: crushtool() [0x4ee5a9]
>> 2015-03-23 12:46:34.637635 7f889ed24780 -1 *** Caught signal (Aborted) **
>>  in thread 7f889ed24780
>>
>>  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>>  1: crushtool() [0x4f4542]
>>  2: (()+0xf130) [0x7f889df97130]
>>  3: (gsignal()+0x39) [0x7f889cfd05c9]
>>  4: (abort()+0x148) [0x7f889cfd1cd8]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f889d8d49d5]
>>  6: (()+0x5e946) [0x7f889d8d2946]
>>  7: (()+0x5e973) [0x7f889d8d2973]
>>  8: (()+0x5eb9f) [0x7f889d8d2b9f]
>>  9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x5b8) [0x523fa8]
>>  10: (main()+0x1e0e) [0x4ead4e]
>>  11: (__libc_start_main()+0xf5) [0x7f889cfbcaf5]
>>  12: crushtool() [0x4ee5a9]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> --- begin dump of recent events ---
>>-14> 2015-03-23 12:46:34.633547 7f889ed24780  5 asok(0x3229cc0)
>> register_command perfcounters_dump hook 0x322be00
>>-13> 2015-03-23 12:46:34.633580 7f889ed24780  5 asok(0x3229cc0)
>> register_command 1 hook 0x322be00
>>-12> 2015-03-23 12:46:34.633587 7f889ed24780  5 asok(0x3229cc0)
>> register_command perf dump hook 0x322be00
>>-11> 2015-03-23 12:46:34.633596 7f889ed24780  5 asok(0x3229cc0)
>> register_command perfcounters_schema hook 0x322be00
>>-10> 2015-03-23 12:46:34.633604 7f889ed24780  5 asok(0x3229cc0)
>> register_command 2 hook 0x322be00
>> -9> 2015-03-23 12:46:34.633609 7f889ed24780  5 asok(0x3229cc0)
>> register_command perf schema hook 0x322be00
>> -8> 2015-03-23 12:46:34.633615 7f889ed24780  5 asok(0x3229cc0)
>> register_command perf reset hook 0x322be00
>> -7> 2015-03-23 12:46:34.633639 7f889ed24780  5 asok(0x3229cc0)
>> register_command config show hook 0x322be00
>> -6> 2015-03-23 12:46:34.633654 7f889ed24780  5 asok(0x3229cc0)
>> register_command config set hook 0x322be00
>> -5> 2015-03-23 12:46:34.633661 7f889ed24780  5 asok(0x3229cc0)
>> register_command co

Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-23 Thread Brendan Moloney

I have been looking at the options for SSD caching for a bit now. Here is my 
take on the current options:

1) bcache - Seems to have lots of reliability issues mentioned on mailing list 
with little sign of improvement.

2) flashcache - Seems to be no longer (or minimally?) developed/maintained, 
instead folks are working on the fork enhanceio.

3) enhanceio - Fork of flashcache.  Dropped the ability to skip caching on 
sequential writes, which many folks have claimed is important for Ceph OSD 
caching performance. (see: https://github.com/stec-inc/EnhanceIO/issues/32)

4) LVM cache (dm-cache) - There is now a user friendly way to use dm-cache, 
through LVM.  Allows sequential writes to be skipped. You need a pretty recent 
kernel.

I am going to be trying out LVM cache on my own cluster in the next few weeks.  
I will share my results here on the mailing list.  If anyone else has tried it 
out I would love to hear about it.

-Brendan

> In a long term use I also had some issues with flashcache and enhanceio. I've 
> noticed frequent slow requests.
> 
> Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-23 Thread Noah Mehl

We deployed with just putting the journal on an SSD directly, why would this 
not work for you?  Just wondering really :)

Thanks!

~Noah

> On Mar 23, 2015, at 4:36 PM, Brendan Moloney  wrote:
> 
> I have been looking at the options for SSD caching for a bit now. Here is my 
> take on the current options:
> 
> 1) bcache - Seems to have lots of reliability issues mentioned on mailing 
> list with little sign of improvement.
> 
> 2) flashcache - Seems to be no longer (or minimally?) developed/maintained, 
> instead folks are working on the fork enhanceio.
> 
> 3) enhanceio - Fork of flashcache.  Dropped the ability to skip caching on 
> sequential writes, which many folks have claimed is important for Ceph OSD 
> caching performance. (see: https://github.com/stec-inc/EnhanceIO/issues/32)
> 
> 4) LVM cache (dm-cache) - There is now a user friendly way to use dm-cache, 
> through LVM.  Allows sequential writes to be skipped. You need a pretty 
> recent kernel.
> 
> I am going to be trying out LVM cache on my own cluster in the next few 
> weeks.  I will share my results here on the mailing list.  If anyone else has 
> tried it out I would love to hear about it.
> 
> -Brendan
> 
>> In a long term use I also had some issues with flashcache and enhanceio. 
>> I've noticed frequent slow requests.
>> 
>> Andrei
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-23 Thread Brendan Moloney

This would be in addition to having the journal on SSD.  The journal doesn't 
help at all with small random reads and has a fairly limited ability to 
coalesce writes. 

In my case, the SSDs we are using for journals should have plenty of 
bandwidth/IOPs/space to spare, so I want to see if I can get a little more out 
of them.

-Brendan


From: Noah Mehl [noahm...@combinedpublic.com]
Sent: Monday, March 23, 2015 1:45 PM
To: Brendan Moloney
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

We deployed with just putting the journal on an SSD directly, why would this 
not work for you?  Just wondering really :)

Thanks!

~Noah

> On Mar 23, 2015, at 4:36 PM, Brendan Moloney  wrote:
>
> I have been looking at the options for SSD caching for a bit now. Here is my 
> take on the current options:
>
> 1) bcache - Seems to have lots of reliability issues mentioned on mailing 
> list with little sign of improvement.
>
> 2) flashcache - Seems to be no longer (or minimally?) developed/maintained, 
> instead folks are working on the fork enhanceio.
>
> 3) enhanceio - Fork of flashcache.  Dropped the ability to skip caching on 
> sequential writes, which many folks have claimed is important for Ceph OSD 
> caching performance. (see: https://github.com/stec-inc/EnhanceIO/issues/32)
>
> 4) LVM cache (dm-cache) - There is now a user friendly way to use dm-cache, 
> through LVM.  Allows sequential writes to be skipped. You need a pretty 
> recent kernel.
>
> I am going to be trying out LVM cache on my own cluster in the next few 
> weeks.  I will share my results here on the mailing list.  If anyone else has 
> tried it out I would love to hear about it.
>
> -Brendan
>
>> In a long term use I also had some issues with flashcache and enhanceio. 
>> I've noticed frequent slow requests.
>>
>> Andrei
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ERROR: missing keyring, cannot use cephx for authentication

2015-03-23 Thread Jesus Chavez (jeschave)

Hi all, I did HA failover test shutting down 1 node and I see that only 1 OSD 
came up after reboot:

[root@geminis ceph]# df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/rhel-root   50G  4.5G   46G   9% /
devtmpfs   126G 0  126G   0% /dev
tmpfs  126G   80K  126G   1% /dev/shm
tmpfs  126G  9.9M  126G   1% /run
tmpfs  126G 0  126G   0% /sys/fs/cgroup
/dev/sda1  494M  165M  330M  34% /boot
/dev/mapper/rhel-home   36G   44M   36G   1% /home
/dev/sdc1  3.7T  134M  3.7T   1% /var/lib/ceph/osd/ceph-14

If I run service ceph restart I got this error message…

Stopping Ceph osd.94 on geminis...done
=== osd.94 ===
2015-03-23 15:05:41.632505 7fe7b9941700 -1 monclient(hunting): ERROR: missing 
keyring, cannot use cephx for authentication
2015-03-23 15:05:41.632508 7fe7b9941700  0 librados: osd.94 initialization 
error (2) No such file or directory
Error connecting to cluster: ObjectNotFound
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.94 
--keyring=/var/lib/ceph/osd/ceph-94/keyring osd crush create-or-move -- 94 0.05 
host=geminis root=default


I have ceph.conf and ceph.client.admin.keyring under /etc/ceph:


[root@geminis ceph]# ls /etc/ceph
ceph.client.admin.keyring  ceph.conf  rbdmap  tmp1OqNFi  tmptQ0a1P
[root@geminis ceph]#


does anybody know what could be wrong?

Thanks



[cid:image005.png@01D00809.A6D502D0]


Jesus Chavez
SYSTEMS ENGINEER-C.SALES

jesch...@cisco.com
Phone: +52 55 5267 3146
Mobile: +51 1 5538883255

CCIE - 44433


Cisco.com





[cid:image006.gif@01D00809.A6D502D0]



  Think before you print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.

Please click 
here for 
Company Registration Information.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-23 Thread Nick Fisk





> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Brendan Moloney
> Sent: 23 March 2015 21:02
> To: Noah Mehl
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
> 
> This would be in addition to having the journal on SSD.  The journal
doesn't
> help at all with small random reads and has a fairly limited ability to
coalesce
> writes.
> 
> In my case, the SSDs we are using for journals should have plenty of
> bandwidth/IOPs/space to spare, so I want to see if I can get a little more
out
> of them.
> 
> -Brendan
> 
> 
> From: Noah Mehl [noahm...@combinedpublic.com]
> Sent: Monday, March 23, 2015 1:45 PM
> To: Brendan Moloney
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
> 
> We deployed with just putting the journal on an SSD directly, why would
this
> not work for you?  Just wondering really :)
> 
> Thanks!
> 
> ~Noah
> 
> > On Mar 23, 2015, at 4:36 PM, Brendan Moloney 
> wrote:
> >
> > I have been looking at the options for SSD caching for a bit now. Here
is my
> take on the current options:
> >
> > 1) bcache - Seems to have lots of reliability issues mentioned on
mailing list
> with little sign of improvement.
> >
> > 2) flashcache - Seems to be no longer (or minimally?)
> developed/maintained, instead folks are working on the fork enhanceio.
> >
> > 3) enhanceio - Fork of flashcache.  Dropped the ability to skip caching
on
> sequential writes, which many folks have claimed is important for Ceph OSD
> caching performance. (see: https://github.com/stec-
> inc/EnhanceIO/issues/32)
> >
> > 4) LVM cache (dm-cache) - There is now a user friendly way to use dm-
> cache, through LVM.  Allows sequential writes to be skipped. You need a
> pretty recent kernel.
> >
> > I am going to be trying out LVM cache on my own cluster in the next few
> weeks.  I will share my results here on the mailing list.  If anyone else
has
> tried it out I would love to hear about it.
> >
> > -Brendan
> >
> >> In a long term use I also had some issues with flashcache and
enhanceio.
> I've noticed frequent slow requests.
> >>
> >> Andrei
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-23 Thread Nick Fisk

Just to add, the main reason it seems to make a difference is the metadata
updates which lie on the actual OSD. When you are doing small block writes,
these metadata updates seem to take almost as long as the actual data, so
although the writes are getting coalesced, the actual performance isn't much
better. 

I did a blktrace a week ago, writing 500MB in 64k blocks to an OSD. You
could see that the actual data was flushed to the OSD in a couple of
seconds, another 30 seconds was spent writing out metadata and doing
EXT4/XFS journal writes. 

Normally I have found flashcache to perform really poorly as it does
everything in 4kb blocks, meaning that when you start throwing larger blocks
at it, it can actually slow things down. However for the purpose of OSD's
you can set the IO cutoff size limit to around 16-32kb and then it should
only cache the metadata updates.

I'm hoping to do some benchmarks before and after flashcache on a SSD
Journaled OSD this week, so will post results when I have them.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Brendan Moloney
> Sent: 23 March 2015 21:02
> To: Noah Mehl
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
> 
> This would be in addition to having the journal on SSD.  The journal
doesn't
> help at all with small random reads and has a fairly limited ability to
coalesce
> writes.
> 
> In my case, the SSDs we are using for journals should have plenty of
> bandwidth/IOPs/space to spare, so I want to see if I can get a little more
out
> of them.
> 
> -Brendan
> 
> 
> From: Noah Mehl [noahm...@combinedpublic.com]
> Sent: Monday, March 23, 2015 1:45 PM
> To: Brendan Moloney
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
> 
> We deployed with just putting the journal on an SSD directly, why would
this
> not work for you?  Just wondering really :)
> 
> Thanks!
> 
> ~Noah
> 
> > On Mar 23, 2015, at 4:36 PM, Brendan Moloney 
> wrote:
> >
> > I have been looking at the options for SSD caching for a bit now. Here
is my
> take on the current options:
> >
> > 1) bcache - Seems to have lots of reliability issues mentioned on
mailing list
> with little sign of improvement.
> >
> > 2) flashcache - Seems to be no longer (or minimally?)
> developed/maintained, instead folks are working on the fork enhanceio.
> >
> > 3) enhanceio - Fork of flashcache.  Dropped the ability to skip caching
on
> sequential writes, which many folks have claimed is important for Ceph OSD
> caching performance. (see: https://github.com/stec-
> inc/EnhanceIO/issues/32)
> >
> > 4) LVM cache (dm-cache) - There is now a user friendly way to use dm-
> cache, through LVM.  Allows sequential writes to be skipped. You need a
> pretty recent kernel.
> >
> > I am going to be trying out LVM cache on my own cluster in the next few
> weeks.  I will share my results here on the mailing list.  If anyone else
has
> tried it out I would love to hear about it.
> >
> > -Brendan
> >
> >> In a long term use I also had some issues with flashcache and
enhanceio.
> I've noticed frequent slow requests.
> >>
> >> Andrei
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CRUSH Map Adjustment for Node Replication

2015-03-23 Thread Georgios Dimitrakakis


Hi all!

I had a CEPH Cluster with 10x OSDs all of them in one node.

Since the cluster was built from the beginning with just one OSDs node 
the crushmap had as a default

the replication to be on OSDs.

Here is the relevant part from my crushmap:


# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type osd
step emit
}

# end crush map


I have added a new node with 10x more identical OSDs thus the total 
OSDs nodes are now two.


I have changed the replication factor to be 2 on all pools and I would 
like to make sure that

I always keep each copy on a different node.

In order to do so do I have to change the CRUSH map?

Which part should I change?


After modifying the CRUSH map what procedure will take place before the 
cluster is ready again?


Is it going to start re-balancing and moving data around? Will a 
deep-scrub follow?


Does the time of the procedure depends on anything else except the 
amount of data and the available connection (bandwidth)?



Looking forward for your answers!


All the best,


George

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS questions

2015-03-23 Thread John Spray


On 23/03/2015 19:00, Bogdan SOLGA wrote:


  * is the http://ceph.com/docs/master/cephfs/ page referring to the
current release version (Giant) or to the HEAD (Hammer) version?
if it's referring to Giant -- are there any major improvements and
fixes for CephFS included in the (upcoming) Hammer release?

The clue is in the URL - that link is built from the latest code in git 
("master").  Substitute hammer/giant to get the docs from a particular 
version.  Yes, there are lots of fixes in hammer.  If you're evaluating 
CephFS you should always use the most recent release you can.


  * the 'one filesystem per Ceph cluster' sounds like a (possible)
drawback, from the flexibility point of view. Is this something
which will be (or is currently) worked on?

There have been steps in that direction (notably the "fs ls" command 
which currently only lists one...).  Many people are also interested in 
multi-tenancy implemented as finer grained access controls within a 
filesystem (i.e. rather than having two filesystems, export two 
directories from one).


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH Map Adjustment for Node Replication

2015-03-23 Thread Robert LeBlanc

You just need to change your rule from

step chooseleaf firstn 0 type osd

to

step chooseleaf firstn 0 type host

There will be data movement as it will want to move about half the
objects to the new host. There will be data generation as you move
from size 1 to size 2. As far as I know a deep scrub won't happen
until the next scheduled time. The time to do all of this is dependent
on your disk speed, network speed, CPU and RAM capacity as well as the
number of backfill processes configured, the priority of the backfill
process, how active your disks are and how much data you have stored
in the cluster. In short ... it depends.

On Mon, Mar 23, 2015 at 4:30 PM, Georgios Dimitrakakis
 wrote:
> Hi all!
>
> I had a CEPH Cluster with 10x OSDs all of them in one node.
>
> Since the cluster was built from the beginning with just one OSDs node the
> crushmap had as a default
> the replication to be on OSDs.
>
> Here is the relevant part from my crushmap:
>
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type osd
> step emit
> }
>
> # end crush map
>
>
> I have added a new node with 10x more identical OSDs thus the total OSDs
> nodes are now two.
>
> I have changed the replication factor to be 2 on all pools and I would like
> to make sure that
> I always keep each copy on a different node.
>
> In order to do so do I have to change the CRUSH map?
>
> Which part should I change?
>
>
> After modifying the CRUSH map what procedure will take place before the
> cluster is ready again?
>
> Is it going to start re-balancing and moving data around? Will a deep-scrub
> follow?
>
> Does the time of the procedure depends on anything else except the amount of
> data and the available connection (bandwidth)?
>
>
> Looking forward for your answers!
>
>
> All the best,
>
>
> George
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH Map Adjustment for Node Replication

2015-03-23 Thread Dimitrakakis Georgios

Robert thanks for the info! 

How can I find out and modify when is scheduled the next deep scrub,
the number of backfill processes and their priority? 

Best regards, 

George 

 Ο χρήστης Robert LeBlanc έγραψε 

>You just need to change your rule from
>
>step chooseleaf firstn 0 type osd
>
>to
>
>step chooseleaf firstn 0 type host
>
>There will be data movement as it will want to move about half the
>objects to the new host. There will be data generation as you move
>from size 1 to size 2. As far as I know a deep scrub won't happen
>until the next scheduled time. The time to do all of this is dependent
>on your disk speed, network speed, CPU and RAM capacity as well as the
>number of backfill processes configured, the priority of the backfill
>process, how active your disks are and how much data you have stored
>in the cluster. In short ... it depends.
>
>On Mon, Mar 23, 2015 at 4:30 PM, Georgios Dimitrakakis
> wrote:
>> Hi all!
>>
>> I had a CEPH Cluster with 10x OSDs all of them in one node.
>>
>> Since the cluster was built from the beginning with just one OSDs node the
>> crushmap had as a default
>> the replication to be on OSDs.
>>
>> Here is the relevant part from my crushmap:
>>
>>
>> # rules
>> rule replicated_ruleset {
>> ruleset 0
>> type replicated
>> min_size 1
>> max_size 10
>> step take default
>> step chooseleaf firstn 0 type osd
>> step emit
>> }
>>
>> # end crush map
>>
>>
>> I have added a new node with 10x more identical OSDs thus the total OSDs
>> nodes are now two.
>>
>> I have changed the replication factor to be 2 on all pools and I would like
>> to make sure that
>> I always keep each copy on a different node.
>>
>> In order to do so do I have to change the CRUSH map?
>>
>> Which part should I change?
>>
>>
>> After modifying the CRUSH map what procedure will take place before the
>> cluster is ready again?
>>
>> Is it going to start re-balancing and moving data around? Will a deep-scrub
>> follow?
>>
>> Does the time of the procedure depends on anything else except the amount of
>> data and the available connection (bandwidth)?
>>
>>
>> Looking forward for your answers!
>>
>>
>> All the best,
>>
>>
>> George
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH Map Adjustment for Node Replication

2015-03-23 Thread Robert LeBlanc

I don't believe that you can set the schedule of the deep scrubs.
People that want that kind of control disable deep scrubs and run a
script to scrub all PGs. For the other options, you should look
through http://ceph.com/docs/master/rados/configuration/osd-config-ref/
and find what you feel might be most important to you. We mess with
"osd max backfills". You may want to look at "osd recovery max
active", "osd recovery op priority" to name a few. You can adjust the
idle load of the cluster to perform deep scrubs, etc.

On Mon, Mar 23, 2015 at 5:10 PM, Dimitrakakis Georgios
 wrote:
> Robert thanks for the info!
>
> How can I find out and modify when is scheduled the next deep scrub,
> the number of backfill processes and their priority?
>
> Best regards,
>
> George
>
>
>
>  Ο χρήστης Robert LeBlanc έγραψε 
>
>
> You just need to change your rule from
>
> step chooseleaf firstn 0 type osd
>
> to
>
> step chooseleaf firstn 0 type host
>
> There will be data movement as it will want to move about half the
> objects to the new host. There will be data generation as you move
> from size 1 to size 2. As far as I know a deep scrub won't happen
> until the next scheduled time. The time to do all of this is dependent
> on your disk speed, network speed, CPU and RAM capacity as well as the
> number of backfill processes configured, the priority of the backfill
> process, how active your disks are and how much data you have stored
> in the cluster. In short ... it depends.
>
> On Mon, Mar 23, 2015 at 4:30 PM, Georgios Dimitrakakis
>  wrote:
>> Hi all!
>>
>> I had a CEPH Cluster with 10x OSDs all of them in one node.
>>
>> Since the cluster was built from the beginning with just one OSDs node the
>> crushmap had as a default
>> the replication to be on OSDs.
>>
>> Here is the relevant part from my crushmap:
>>
>>
>> # rules
>> rule replicated_ruleset {
>> ruleset 0
>> type replicated
>> min_size 1
>> max_size 10
>> step take default
>> step chooseleaf firstn 0 type osd
>> step emit
>> }
>>
>> # end crush map
>>
>>
>> I have added a new node with 10x more identical OSDs thus the total OSDs
>> nodes are now two.
>>
>> I have changed the replication factor to be 2 on all pools and I would
>> like
>> to make sure that
>> I always keep each copy on a different node.
>>
>> In order to do so do I have to change the CRUSH map?
>>
>> Which part should I change?
>>
>>
>> After modifying the CRUSH map what procedure will take place before the
>> cluster is ready again?
>>
>> Is it going to start re-balancing and moving data around? Will a
>> deep-scrub
>> follow?
>>
>> Does the time of the procedure depends on anything else except the amount
>> of
>> data and the available connection (bandwidth)?
>>
>>
>> Looking forward for your answers!
>>
>>
>> All the best,
>>
>>
>> George
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-23 Thread Noah Mehl

A. I see now.  Has anyone used 
cachecade
 from LSI for both the read and write cache to SSD?  I don’t know if you can 
attach a cachecade device to a JBOD, but if you could it would probably perform 
really well….

I submit this because I really haven’t seen an opensouce read and write SSD 
cache that performs as well as ZFS for instance.  And for ZFS, I don’t know if 
you can add a SSD cache to a single drive?

Thanks!

~Noah

On Mar 23, 2015, at 5:43 PM, Nick Fisk 
mailto:n...@fisk.me.uk>> wrote:

Just to add, the main reason it seems to make a difference is the metadata
updates which lie on the actual OSD. When you are doing small block writes,
these metadata updates seem to take almost as long as the actual data, so
although the writes are getting coalesced, the actual performance isn't much
better.

I did a blktrace a week ago, writing 500MB in 64k blocks to an OSD. You
could see that the actual data was flushed to the OSD in a couple of
seconds, another 30 seconds was spent writing out metadata and doing
EXT4/XFS journal writes.

Normally I have found flashcache to perform really poorly as it does
everything in 4kb blocks, meaning that when you start throwing larger blocks
at it, it can actually slow things down. However for the purpose of OSD's
you can set the IO cutoff size limit to around 16-32kb and then it should
only cache the metadata updates.

I'm hoping to do some benchmarks before and after flashcache on a SSD
Journaled OSD this week, so will post results when I have them.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Brendan Moloney
Sent: 23 March 2015 21:02
To: Noah Mehl
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

This would be in addition to having the journal on SSD.  The journal
doesn't
help at all with small random reads and has a fairly limited ability to
coalesce
writes.

In my case, the SSDs we are using for journals should have plenty of
bandwidth/IOPs/space to spare, so I want to see if I can get a little more
out
of them.

-Brendan


From: Noah Mehl 
[noahm...@combinedpublic.com]
Sent: Monday, March 23, 2015 1:45 PM
To: Brendan Moloney
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

We deployed with just putting the journal on an SSD directly, why would
this
not work for you?  Just wondering really :)

Thanks!

~Noah

On Mar 23, 2015, at 4:36 PM, Brendan Moloney 
mailto:molo...@ohsu.edu>>
wrote:

I have been looking at the options for SSD caching for a bit now. Here
is my
take on the current options:

1) bcache - Seems to have lots of reliability issues mentioned on
mailing list
with little sign of improvement.

2) flashcache - Seems to be no longer (or minimally?)
developed/maintained, instead folks are working on the fork enhanceio.

3) enhanceio - Fork of flashcache.  Dropped the ability to skip caching
on
sequential writes, which many folks have claimed is important for Ceph OSD
caching performance. (see: https://github.com/stec-
inc/EnhanceIO/issues/32)

4) LVM cache (dm-cache) - There is now a user friendly way to use dm-
cache, through LVM.  Allows sequential writes to be skipped. You need a
pretty recent kernel.

I am going to be trying out LVM cache on my own cluster in the next few
weeks.  I will share my results here on the mailing list.  If anyone else
has
tried it out I would love to hear about it.

-Brendan

In a long term use I also had some issues with flashcache and
enhanceio.
I've noticed frequent slow requests.

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH Map Adjustment for Node Replication

2015-03-23 Thread Christian Balzer


Georgios,

it really depends on how busy and powerful your cluster is, as Robert
wrote.
If in doubt, lower the backfill value as pointed out by Robert. 
Look at the osd_scrub_load_threshold and with new enough
versions of Ceph at the osd_scrub_sleep setting, this is very helpful in keeping
deep scrubs from making he cluster excessively sluggish.

Then make the CRUSH change at a time when your cluster is least busy
(weekend nights for many people). 
Wait until the data movement has finished.
After that (maybe the next night) deep scrub all your OSDs, either
sequentially (less impact):

"ceph osd deep-scrub 0" ...

or if your cluster is fast enough all at once:

"ceph osd deep-scrub \*"

My current clusters are fast enough to do this within a few hours, so
basically once you've kicked a deep scrub off at the correct time, it will
happen (with default settings) again a week later, thus (in my case at
least) never having a deep scrub during business hours.

Of course people with really large clusters tend to have enough reserves
that deep scrubs (and rebuilds/backfills due to failed OSDs) during peak
times are not an issue at all (looking at Dan over at CERN ^o^).

Christian

On Mon, 23 Mar 2015 17:24:25 -0600 Robert LeBlanc wrote:

> I don't believe that you can set the schedule of the deep scrubs.
> People that want that kind of control disable deep scrubs and run a
> script to scrub all PGs. For the other options, you should look
> through http://ceph.com/docs/master/rados/configuration/osd-config-ref/
> and find what you feel might be most important to you. We mess with
> "osd max backfills". You may want to look at "osd recovery max
> active", "osd recovery op priority" to name a few. You can adjust the
> idle load of the cluster to perform deep scrubs, etc.
> 
> On Mon, Mar 23, 2015 at 5:10 PM, Dimitrakakis Georgios
>  wrote:
> > Robert thanks for the info!
> >
> > How can I find out and modify when is scheduled the next deep scrub,
> > the number of backfill processes and their priority?
> >
> > Best regards,
> >
> > George
> >
> >
> >
> >  Ο χρήστης Robert LeBlanc έγραψε 
> >
> >
> > You just need to change your rule from
> >
> > step chooseleaf firstn 0 type osd
> >
> > to
> >
> > step chooseleaf firstn 0 type host
> >
> > There will be data movement as it will want to move about half the
> > objects to the new host. There will be data generation as you move
> > from size 1 to size 2. As far as I know a deep scrub won't happen
> > until the next scheduled time. The time to do all of this is dependent
> > on your disk speed, network speed, CPU and RAM capacity as well as the
> > number of backfill processes configured, the priority of the backfill
> > process, how active your disks are and how much data you have stored
> > in the cluster. In short ... it depends.
> >
> > On Mon, Mar 23, 2015 at 4:30 PM, Georgios Dimitrakakis
> >  wrote:
> >> Hi all!
> >>
> >> I had a CEPH Cluster with 10x OSDs all of them in one node.
> >>
> >> Since the cluster was built from the beginning with just one OSDs
> >> node the crushmap had as a default
> >> the replication to be on OSDs.
> >>
> >> Here is the relevant part from my crushmap:
> >>
> >>
> >> # rules
> >> rule replicated_ruleset {
> >> ruleset 0
> >> type replicated
> >> min_size 1
> >> max_size 10
> >> step take default
> >> step chooseleaf firstn 0 type osd
> >> step emit
> >> }
> >>
> >> # end crush map
> >>
> >>
> >> I have added a new node with 10x more identical OSDs thus the total
> >> OSDs nodes are now two.
> >>
> >> I have changed the replication factor to be 2 on all pools and I would
> >> like
> >> to make sure that
> >> I always keep each copy on a different node.
> >>
> >> In order to do so do I have to change the CRUSH map?
> >>
> >> Which part should I change?
> >>
> >>
> >> After modifying the CRUSH map what procedure will take place before
> >> the cluster is ready again?
> >>
> >> Is it going to start re-balancing and moving data around? Will a
> >> deep-scrub
> >> follow?
> >>
> >> Does the time of the procedure depends on anything else except the
> >> amount of
> >> data and the available connection (bandwidth)?
> >>
> >>
> >> Looking forward for your answers!
> >>
> >>
> >> All the best,
> >>
> >>
> >> George
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
_

[ceph-users] Does crushtool --test --simulate do what cluster should do?

2015-03-23 Thread Robert LeBlanc

I'm trying to create a CRUSH ruleset and I'm using crushtool to test
the rules, but it doesn't seem to mapping things correctly. I have two
roots, on for spindles and another for SSD. I have two rules, one for
each root. The output of crushtool on rule 0 shows objects being
mapped to SSD OSDs when it should only be choosing spindles.

I'm pretty sure I'm doing something wrong. I've tested the map on .93 and .80.8.

The map is at http://pastebin.com/BjmuASX0

when running

crushtool -i map.crush --test --num-rep 3 --rule 0 --simulate --show-mappings

I'm getting mapping to OSDs > 39 which are SSDs. The same happens when
I run the SSD rule, I get OSDs from both roots. It is as if crushtool
is not selecting the correct root. In fact both rules result in the
same mapping:

RNG rule 0 x 0 [0,38,23]
RNG rule 0 x 1 [10,25,1]
RNG rule 0 x 2 [11,40,0]
RNG rule 0 x 3 [5,30,26]
RNG rule 0 x 4 [44,30,10]
RNG rule 0 x 5 [8,26,16]
RNG rule 0 x 6 [24,5,36]
RNG rule 0 x 7 [38,10,9]
RNG rule 0 x 8 [39,9,23]
RNG rule 0 x 9 [12,3,24]
RNG rule 0 x 10 [18,6,41]
...

RNG rule 1 x 0 [0,38,23]
RNG rule 1 x 1 [10,25,1]
RNG rule 1 x 2 [11,40,0]
RNG rule 1 x 3 [5,30,26]
RNG rule 1 x 4 [44,30,10]
RNG rule 1 x 5 [8,26,16]
RNG rule 1 x 6 [24,5,36]
RNG rule 1 x 7 [38,10,9]
RNG rule 1 x 8 [39,9,23]
RNG rule 1 x 9 [12,3,24]
RNG rule 1 x 10 [18,6,41]
...


Thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Write IO Problem

2015-03-23 Thread Christian Balzer


Hello,

If you had used "performance" or "slow" in your subject future generations
would be able find this thread and what it is about more easily. ^_-

Also, check the various "SSD" + "performance" threads in the ML archives.

On Fri, 20 Mar 2015 14:13:19 + Rottmann Jonas wrote:

> Hi,
> 
> We have a huge write IO Problem in our preproductive Ceph Cluster. First
> our Hardware:
> 
You're not telling us your Ceph version, but from the tunables below I
suppose it is Firefly?
If you have the time, it would definitely be advisable to wait for Hammer
with an all SSD cluster.

> 4 OSD Nodes with:
> 
> Supermicro X10 Board
> 32GB DDR4 RAM
> 2x Intel Xeon E5-2620
> LSI SAS 9300-8i Host Bus Adapter
> Intel Corporation 82599EB 10-Gigabit
> 2x Intel SSDSA2CT040G3 in software raid 1 for system
> 
Nobody really knows what those inane Intel product codes are without
looking them up. 
So you have 2 Intel 320 40GB consumer SSDs that are EOL'ed for the OS.
In a very modern, up to date system otherwise...

When you say "pre-production" cluster up there, does that mean that this
is purely a test bed, or are you planning to turn this into production
eventually?

> Disks:
> 2x Samsung EVO 840 1TB
> 
Unless you're planning to do _very_ little writes, these will wear out in
no time. 
With small IOPS (4KB) you can see up to 12x write amplification with Ceph.
Consider investing in data center level SSDs like the 845 DC PRO or
comparable Intel (S3610, S3700).


> So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only
> added nodiratime)
> 
Why BTRFS?
As in, what made you feel that this was a good, safe choice?
I guess with SSDs for backing storage you won't at least have to worry
about the massive fragmentation of BTRFS with Ceph...

> Benchmarking one disk alone gives good values:
> 
> dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
> 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s
> 
> Fio 8k libaio depth=32:
> write: io=488184KB, bw=52782KB/s, iops=5068 , runt=  9249msec
>
And this is where you start comparing apples to oranges.
That fio was with 8KB blocks and 32 threads.
 
> Here our ceph.conf (pretty much standard):
> 
> [global]
> fsid = 89191a54-740a-46c7-a325-0899ab32fd1d
> mon initial members = cephasp41,ceph-monitor41
> mon host = 172.30.10.15,172.30.10.19
> public network = 172.30.10.0/24
> cluster network = 172.30.10.0/24
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> 
> #Default is 1GB, which is fine for us
> #osd journal size = {n}
> 
> #Only needed if ext4 comes to play
> #filestore xattr use omap = true
> 
> osd pool default size = 3  # Write an object n times.
> osd pool default min size = 2 # Allow writing n copy in a degraded state.
> 
Normally I'd say a replication of 2 is sufficient with SSDs, but given
your choice of SSDs I'll refrain from that.

> #Set individual per pool by a formula
> #osd pool default pg num = {n}
> #osd pool default pgp num = {n}
> #osd crush chooseleaf type = {n}
> 
> 
> When I benchmark the cluster with "rbd bench-write rbd/fio" I get pretty
> good results: elapsed:18  ops:   262144  ops/sec: 14466.30
> bytes/sec: 59253946.11
> 
Apple and oranges time again, this time you're testing with 4K blocks and
16 threads (defaults for this test).

Incidentally, I get this from a 3 node cluster (replication 3) with 8 OSDs
per node (SATA disk, journals on 4 Intel DC S3700 100GB) and Infiniband
(4QDR) interconnect:
elapsed: 7  ops:   246724  ops/sec: 31157.87  bytes/sec: 135599456.06

> If I for example bench i.e. with fio with rbd engine, I get very poor
> results:
> 
> [global]
> ioengine=rbd
> clientname=admin
> pool=rbd
> rbdname=fio
> invalidate=0# mandatory
> rw=randwrite
> bs=512k
> 
> [rbd_iodepth32]
> iodepth=32
> 
> RESULTS:
> ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec
>
Total apples and oranges time, now you're having 512KB blocks (which of
course will reduce IOPS) and 32 threads.
The bandwidth is still about the same as before and if you multiply
105x128(to compensate for 4KB blocks) you wind with 13440, close to what
you've seen with the rbd bench. 
Also from where are you benching?
 
> Also if I mount the rbd with kernel as rbd0, format it with ext4 and
> then do a dd on it, its not that good: "dd if=/dev/zero of=tempfile
> bs=1M count=1024 conv=fdatasync,notrunc" RESULT:
> 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s
> 
Mounting it where? 
Same system that you did the other tests from?

Did you format it w/o lazy init or waited until the lazy init finished
before doing the test?

> I also tried presenting an rbd image with tgtd, mount it onto VMWare
> ESXi and test it in a vm, there I got only round about 50 iops with 4k,
> and writing sequentiell 25Mbytes. With NFS the read sequential values
> are good (400Mbyte/s) but writing only 25Mbyte/s.
>
Can't really comment on that, many things that could cause this and I'm
not an expert in either.
 
> Wha

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Kamil Kuramshin


Yes I read it and do no not understand what you mean when say *verify this*?
All 3335808 inodes are definetly files and direcories created by ceph 
OSD process:


*tune2fs 1.42.5 (29-Jul-2012)*
Filesystem volume name:   
Last mounted on:  /var/lib/ceph/tmp/mnt.05NAJ3
Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index 
filetype extent flex_bg sparse_super large_file huge_file uninit_bg 
dir_nlink extra_isize

Filesystem flags: signed_directory_hash
Default mount options:user_xattr acl
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
*Inode count:  3335808*
Block count:  13342945
Reserved block count: 667147
Free blocks:  5674105
*Free inodes:  0*
First block:  0
Block size:   4096
Fragment size:4096
Reserved GDT blocks:  1020
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 8176
Inode blocks per group:   511
Flex block group size:16
Filesystem created:   Fri Feb 20 16:44:25 2015
Last mount time:  Tue Mar 24 09:33:19 2015
Last write time:  Tue Mar 24 09:33:27 2015
Mount count:  7
Maximum mount count:  -1
Last checked: Fri Feb 20 16:44:25 2015
Check interval:   0 ()
Lifetime writes:  4116 GB
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   256
Required extra isize: 28
Desired extra isize:  28
Journal inode:8
Default directory hash:   half_md4
Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b
Journal backup:   inode blocks

*fsck.ext4 /dev/sda1*
e2fsck 1.42.5 (29-Jul-2012)
/dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks

23.03.2015 17:09, Christian Balzer пишет:

On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote:


Yes, I understand that.

The initial purpose of first email was just an advise for new comers. My
fault was in that I was selected ext4 for SSD disks as backend.
But I  did not foresee that inode number can reach its limit before the
free space :)

And maybe there must be some sort of warning not only for free space in
MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes
for filesystems with static inode allocation  like ext4.
Because if OSD reach inode limit it becames totally unusable and
immediately goes down, and from that moment there is no way to start it!


While all that is true and should probably be addressed, please re-read
what I wrote before.

With the 3.3 million inodes used and thus likely as many files (did you
verify this?) and 4MB objects that would make something in the 12TB
ballpark area.

Something very very strange and wrong is going on with your cache tier.

Christian


23.03.2015 13:42, Thomas Foster пишет:

You could fix this by changing your block size when formatting the
mount-point with the mkfs -b command.  I had this same issue when
dealing with the filesystem using glusterfs and the solution is to
either use a filesystem that allocates inodes automatically or change
the block size when you build the filesystem.  Unfortunately, the only
way to fix the problem that I have seen is to reformat

On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin
mailto:kamil.kurams...@tatar.ru>> wrote:

 In my case there was cache pool for ec-pool serving RBD-images,
 and object size is 4Mb, and client was an /kernel-rbd /client
 each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12
 OSDs in total


 23.03.2015 12:00, Christian Balzer пишет:

 Hello,

 This is rather confusing, as cache-tiers are just normal
OSDs/pools and thus should have Ceph objects of around 4MB in size by
default.

 This is matched on what I see with Ext4 here (normal OSD, not a
cache tier):
 ---
 size:
 /dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
 inodes:
 /dev/sde1  183148544 55654 183092890
1% /var/lib/ceph/osd/ceph-0 ---

 On a more fragmented cluster I see a 5:1 size to inode ratio.

 I just can't fathom how there could be 3.3 million inodes (and
thus a close number of files) using 30G, making the average file size
below 10 Bytes.

 Something other than your choice of file system is probably at
play here.

 How fragmented are those SSDs?
 What's your default Ceph object size?
 Where _are_ those 3 million files in that OSD, are they actually
in the object files like:
 -rw-r--r-- 1 root root 4194304 Jan  9
15:27 
/var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

 What's your use case, RBD, CephFS, RadosGW?

 Regards,

 Christian

 On Mon, 23 Mar 2015 10:32:

Re: [ceph-users] Write IO Problem

2015-03-23 Thread Alexandre DERUMIER

Hi,

>>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
>>
>>1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 

How much do you get with o_dsync? (ceph journal use o_dsync, and some ssd are 
pretty slow with dsync)

http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
$ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync
$ sudo dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync



>>When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty good 
>>results: 
>>elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 

theses results seem strange.
14466.30 bytes/sec for 262144 ops/sec ? (0,05 bytes by ops )
BTW, I never see big write ops/s with ceph without big big cluster and big cpus



about dd benchmark, the problem is that dd use 1job / iodepth=1 / sequential.
So here, network latencies make the difference. (but ceph team is also working 
to optimize that, with async messenger for example)
That's why you'll have more iops with fio, with more jobs/ bigger iodepth.



If you use full ssd setup, you should use at least Giant, because of sharding 
feature.
With firefly, osd daemons doesn't scale well on multiple cores.

Also from my tests, writes use a lot more cpu than read. (can be cpu bound on 3 
nodes 8cores xeon-E5 1,7ghz, replication x3, with 1 4k randwrite)



also disabling cephx auth and debug help to get more iops.


if your workload is mainly sequential, enabling rbd_cache will help for writes, 
merging colleasced blocks request,
so less ops (but bigger ops), so less cpu.


Alexandre


- Mail original -
De: "Rottmann Jonas" 
À: "ceph-users" 
Envoyé: Vendredi 20 Mars 2015 15:13:19
Objet: [ceph-users] Write IO Problem



Hi, 



We have a huge write IO Problem in our preproductive Ceph Cluster. First our 
Hardware: 



4 OSD Nodes with: 



Supermicro X10 Board 

32GB DDR4 RAM 

2x Intel Xeon E5-2620 

LSI SAS 9300-8i Host Bus Adapter 

Intel Corporation 82599EB 10-Gigabit 

2x Intel SSDSA2CT040G3 in software raid 1 for system 



Disks: 

2x Samsung EVO 840 1TB 



So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only added 
nodiratime) 



Benchmarking one disk alone gives good values: 



dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 

1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 



Fio 8k libaio depth=32: 

write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec 



Here our ceph.conf (pretty much standard): 



[global] 

fsid = 89191a54-740a-46c7-a325-0899ab32fd1d 

mon initial members = cephasp41,ceph-monitor41 

mon host = 172.30.10.15,172.30.10.19 

public network = 172.30.10.0/24 

cluster network = 172.30.10.0/24 

auth cluster required = cephx 

auth service required = cephx 

auth client required = cephx 



#Default is 1GB, which is fine for us 

#osd journal size = {n} 



#Only needed if ext4 comes to play 

#filestore xattr use omap = true 



osd pool default size = 3 # Write an object n times. 

osd pool default min size = 2 # Allow writing n copy in a degraded state. 



#Set individual per pool by a formula 

#osd pool default pg num = {n} 

#osd pool default pgp num = {n} 

#osd crush chooseleaf type = {n} 





When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty good 
results: 

elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 



If I for example bench i.e. with fio with rbd engine, I get very poor results: 



[global] 

ioengine=rbd 

clientname=admin 

pool=rbd 

rbdname=fio 

invalidate=0 # mandatory 

rw=randwrite 

bs=512k 



[rbd_iodepth32] 

iodepth=32 



RESULTS: 

ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec 



Also if I mount the rbd with kernel as rbd0, format it with ext4 and then do a 
dd on it, its not that good: 

“dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” 

RESULT: 

1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s 



I also tried presenting an rbd image with tgtd, mount it onto VMWare ESXi and 
test it in a vm, there I got only round about 50 iops with 4k, and writing 
sequentiell 25Mbytes. 

With NFS the read sequential values are good (400Mbyte/s) but writing only 
25Mbyte/s. 



What I tried tweaking so far: 



Intel NIC optimazitions: 

/etc/sysctl.conf 



# Increase system file descriptor limit 

fs.file-max = 65535 



# Increase system IP port range to allow for more concurrent connections 

net.ipv4.ip_local_port_range = 1024 65000 



# -- 10gbe tuning from Intel ixgb driver README -- # 



# turn off selective ACK and timestamps 

net.ipv4.tcp_sack = 0 

net.ipv4.tcp_timestamps = 0 



# memory allocation min/pressure/max. 

# read buffer, write buffer, and buffer space 

net.ipv4.tcp_rmem = 1000 1000 1000 

net.ipv4.tcp_wmem = 1000 1000 1000 

net.ipv4.tcp_mem = 1000 1000 1000 



net.core.rmem_max = 524287 

net.core.wmem_max = 5

72 matches

Mail list logo