date:20150316

[ceph-users] Fw: query about mapping of Swift/S3 APIs to Ceph cluster APIs

2015-03-16 Thread pragya jain

please somebody answer my queries. -RegardsPragya JainDepartment of 
Computer ScienceUniversity of DelhiDelhi, India

  On Saturday, 14 March 2015 3:34 PM, pragya jain  
wrote:
   
 

 Hello all!
I am working on Ceph object storage architecture from last few months.
I am unable to search  a document which can describe how Ceph object storage 
APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs (librados APIs) 
to store the data at Ceph storage cluster.
As the documents say: Radosgw, a gateway interface for ceph object storage 
users, accept user request to store or retrieve data in the form of Swift APIs 
or S3 APIs and convert the user's request in RADOS request.
Please help me in knowing1. how does Radosgw convert user request to RADOS 
request ?2. how are HTTP requests mapped with RADOS request?
Thank you -RegardsPragya JainDepartment of Computer ScienceUniversity of 
DelhiDelhi, India
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 
   ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] query about region and zone creation while configuring RADOSGW

2015-03-16 Thread pragya jain

hello all!
I am working on Ceph object storage architecture.I have some queries:
In case of configuring federated system, we need to create regions containing 
one or more zones and the cluster must have a master region and each region 
must have a master zone.
but in case of simple gateway configuration, is there a need to create at least 
a region and a zone to store the data?
Please somebody reply my query.
Thank you -RegardsPragya JainDepartment of Computer ScienceUniversity of 
DelhiDelhi, India___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Alexandre DERUMIER

>>That full system slows down, OK, but brutal stop... 

This is strange, that could be:

- qemu crash, maybe a bug in rbd block storage (if you use librbd)
- oom-killer on you host (any logs ?)

what is your qemu version ?


- Mail original -
De: "Florent Bautista" 
À: "ceph-users" 
Envoyé: Lundi 16 Mars 2015 10:11:43
Objet: Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

Of course but it does not explain why VMs stopped... 
That full system slows down, OK, but brutal stop... 

On 03/14/2015 07:00 PM, Andrija Panic wrote: 



changin PG number - causes LOOOT of data rebalancing (in my case was 80%) 
which I learned the hard way... 

On 14 March 2015 at 18:49, Gabri Mate < mailingl...@modernbiztonsag.org > 
wrote: 

BQ_BEGIN
I had the same issue a few days ago. I was increasing the pg_num of one 
pool from 512 to 1024 and all the VMs in that pool stopped. I came to 
the conclusion that doubling the pg_num caused such a high load in ceph 
that the VMs were blocked. The next time I will test with small 
increments. 


On 12:38 Sat 14 Mar , Florent B wrote: 
> Hi all, 
> 
> I have a Giant cluster in production. 
> 
> Today one of my RBD pools had the "too few pgs" warning. So I changed 
> pg_num & pgp_num. 
> 
> And at this moment, some of the VM stored on this pool were stopped (on 
> some hosts, not all, it depends, no logic) 
> 
> All was running fine for months... 
> 
> Have you ever seen this ? 
> What could have caused this ? 
> 
> Thank you. 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 






-- 

Andrija Panić 


___
ceph-users mailing list ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

BQ_END


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Steffen W Sørensen


On 16/03/2015, at 11.14, Florent B  wrote:

> On 03/16/2015 11:03 AM, Alexandre DERUMIER wrote:
>> This is strange, that could be:
>> 
>> - qemu crash, maybe a bug in rbd block storage (if you use librbd)
>> - oom-killer on you host (any logs ?)
>> 
>> what is your qemu version ?
>> 
> 
> Now, we have version 2.1.3.
> 
> Some VMs that stopped were running for a long time, but some other had
> only 4 days uptime.
> 
> And I precise that not all VMs on that pool crashed, only some of them
> (a large majority), and on a same host, some crashed and others not.
> 
> We use Proxmox, so I think it uses librbd ?
I had the same issue once also when bumping up PG_NUM, majority of my ProxMox 
VMs stopped. I believe this might be due to heavy rebalancing causing time out 
when VMs tries to do IO OPs and thus generating kernel panics.

Next time around I want to go smaller increments of pg_num and hopefully avoid 
this.

I follow the need for more PGs when having more OSDs, but how come PGs gets to 
few when adding more objects/data to a pool?

/Steffen


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Alexandre DERUMIER

>>We use Proxmox, so I think it uses librbd ? 

As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's 
librbd ;)

Is the ceph cluster on dedicated nodes ? or vms are running on same nodes than 
osd daemons ?

>>And I precise that not all VMs on that pool crashed, only some of them 
>>(a large majority), and on a same host, some crashed and others not. 

Is the vm crashed, like no more qemu process ?
or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or ide 
for your guest ?)

- Mail original -
De: "Florent Bautista" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Lundi 16 Mars 2015 11:14:45
Objet: Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

On 03/16/2015 11:03 AM, Alexandre DERUMIER wrote: 
> This is strange, that could be: 
> 
> - qemu crash, maybe a bug in rbd block storage (if you use librbd) 
> - oom-killer on you host (any logs ?) 
> 
> what is your qemu version ? 
> 

Now, we have version 2.1.3. 

Some VMs that stopped were running for a long time, but some other had 
only 4 days uptime. 

And I precise that not all VMs on that pool crashed, only some of them 
(a large majority), and on a same host, some crashed and others not. 

We use Proxmox, so I think it uses librbd ? 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Alexandre DERUMIER

>>VMs are running on the same nodes than OSD

Are you sure that you didn't some kind of out of memory.
pg rebalance can be memory hungry. (depend how many osd you have).

do you see oom-killer in your host logs ?


- Mail original -
De: "Florent Bautista" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Lundi 16 Mars 2015 12:35:11
Objet: Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

On 03/16/2015 12:23 PM, Alexandre DERUMIER wrote: 
>>> We use Proxmox, so I think it uses librbd ? 
> As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's 
> librbd ;) 
> 
> Is the ceph cluster on dedicated nodes ? or vms are running on same nodes 
> than osd daemons ? 
> 

VMs are running on the same nodes than OSD 

>>> And I precise that not all VMs on that pool crashed, only some of them 
>>> (a large majority), and on a same host, some crashed and others not. 
> Is the vm crashed, like no more qemu process ? 
> or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or 
> ide for your guest ?) 
> 
> 

I don't really know what crashed, I think qemu process but not sure. 
We use virtio 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Steffen W Sørensen


On 16/03/2015, at 12.23, Alexandre DERUMIER  wrote:

>>> We use Proxmox, so I think it uses librbd ? 
> 
> As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's 
> librbd ;)
> Is the ceph cluster on dedicated nodes ? or vms are running on same nodes 
> than osd daemons ?
My cluster have Ceph OSDs+MONs on seperate PVE nodes, no VMs

> 
> 
>>> And I precise that not all VMs on that pool crashed, only some of them 
>>> (a large majority), and on a same host, some crashed and others not. 
> 
> Is the vm crashed, like no more qemu process ?
> or is it the guest os which is crashed ?
Hmm long time now, remember VM status was stopped, resumed didn't work aka they 
were started again asap :)

> (do you use virtio, virtio-scsi or ide for your guest ?)
virtio

/Steffen



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] PHP Rados failed in read operation if object size is large (say more than 10 MB )

2015-03-16 Thread Gaurang Vyas

running on ubuntu with nginx + php-fpm


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Azad Aliyar

May I know your ceph version.?. The latest version of firefly 80.9 has
patches to avoid excessive data migrations during rewighting osds. You may
need set a tunable inorder make this patch active.

This is a bugfix release for firefly.  It fixes a performance regression
in librbd, an important CRUSH misbehavior (see below), and several RGW
bugs.  We have also backported support for flock/fcntl locks to ceph-fuse
and libcephfs.

We recommend that all Firefly users upgrade.

For more detailed information, see
  http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt

Adjusting CRUSH maps

* This point release fixes several issues with CRUSH that trigger
  excessive data migration when adjusting OSD weights.  These are most
  obvious when a very small weight change (e.g., a change from 0 to
  .01) triggers a large amount of movement, but the same set of bugs
  can also lead to excessive (though less noticeable) movement in
  other cases.

  However, because the bug may already have affected your cluster,
  fixing it may trigger movement *back* to the more correct location.
  For this reason, you must manually opt-in to the fixed behavior.

  In order to set the new tunable to correct the behavior::

 ceph osd crush set-tunable straw_calc_version 1

  Note that this change will have no immediate effect.  However, from
  this point forward, any 'straw' bucket in your CRUSH map that is
  adjusted will get non-buggy internal weights, and that transition
  may trigger some rebalancing.

  You can estimate how much rebalancing will eventually be necessary
  on your cluster with::

 ceph osd getcrushmap -o /tmp/cm
 crushtool -i /tmp/cm --num-rep 3 --test --show-mappings > /tmp/a 2>&1
 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings > /tmp/b 2>&1
 wc -l /tmp/a  # num total mappings
 diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings

   Divide the total number of lines in /tmp/a with the number of lines
   changed.  We've found that most clusters are under 10%.

   You can force all of this rebalancing to happen at once with::

 ceph osd crush reweight-all

   Otherwise, it will happen at some unknown point in the future when
   CRUSH weights are next adjusted.

Notable Changes
---

* ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
* crush: fix straw bucket weight calculation, add straw_calc_version
  tunable (#10095 Sage Weil)
* crush: fix tree bucket (Rongzu Zhu)
* crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
* crushtool: add --reweight (Sage Weil)
* librbd: complete pending operations before losing image (#10299 Jason
  Dillaman)
* librbd: fix read caching performance regression (#9854 Jason Dillaman)
* librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman)
* mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
* osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
* osd: handle no-op write with snapshot (#10262 Sage Weil)
* radosgw-admi

On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote:
>>> VMs are running on the same nodes than OSD
> Are you sure that you didn't some kind of out of memory.
> pg rebalance can be memory hungry. (depend how many osd you have).

2 OSD per host, and 5 hosts in this cluster.
hosts h
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PHP Rados failed in read operation if object size is large (say more than 10 MB )

2015-03-16 Thread Wido den Hollander

On 03/16/2015 01:55 PM, Gaurang Vyas wrote:
> running on ubuntu with nginx + php-fpm
> 
>  $rados = rados_create('admin');
> 
> 
> rados_conf_read_file($rados, '/etc/ceph/ceph.conf');
> rados_conf_set($rados, 'keyring','/etc/ceph/ceph.client.admin.keyring');
> 
> $temp = rados_conf_get($rados, "rados_osd_op_timeout");
> echo  "osd ";
> echo $temp;
> $temp = rados_conf_get($rados, "client_mount_timeout");
> echo  "clinet  " ;
> echo $temp;
> $temp = rados_conf_get($rados, "rados_mon_op_timeout");
> echo   "mon  " ;
> echo $temp;
> 
> $err = rados_connect($rados);
> $ioRados = rados_ioctx_create($rados,'dev_whereis');
> 
> $pieceSize = rados_stat($ioRados,'TEMP_object');
> var_dump($pieceSize);
> 
> $piece = rados_read($ioRados, 'TEMP_object',$pieceSize['psize'] ,0);
> 

So what is the error exactly? Are you running phprados from the master
branch on Github?

> echo $piece;
> ?>
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph release timeline

2015-03-16 Thread David Moreau Simard

Great work !

David Moreau Simard

On 2015-03-15 06:29 PM, Loic Dachary wrote:
> Hi Ceph,
>
> In an attempt to clarify what Ceph release is stable, LTS or development. a 
> new page was added to the documentation: 
> http://ceph.com/docs/master/releases/ It is a matrix where each cell is a 
> release number linked to the release notes from 
> http://ceph.com/docs/master/release-notes/. One line per month and one column 
> per release.
>
> Cheers
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Chu Duc Minh

I'm using the latest Giant and have the same issue. When i increase PG_num
of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148
to 2400, some VMs die (Qemu-kvm process die).
My physical servers (host VMs) running kernel 3.13 and use librbd.
I think it's a bug in librbd with crushmap.
(I set crush_tunables3 on my ceph cluster, does it make sense?)

Do you know a way to safely increase PG_num? (I don't think increase PG_num
100 each times is a safe & good way)

Regards,

On Mon, Mar 16, 2015 at 8:50 PM, Florent B  wrote:

> We are on Giant.
>
> On 03/16/2015 02:03 PM, Azad Aliyar wrote:
> >
> > May I know your ceph version.?. The latest version of firefly 80.9 has
> > patches to avoid excessive data migrations during rewighting osds. You
> > may need set a tunable inorder make this patch active.
> >
> > This is a bugfix release for firefly.  It fixes a performance regression
> > in librbd, an important CRUSH misbehavior (see below), and several RGW
> > bugs.  We have also backported support for flock/fcntl locks to ceph-fuse
> > and libcephfs.
> >
> > We recommend that all Firefly users upgrade.
> >
> > For more detailed information, see
> >   http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt
> >
> > Adjusting CRUSH maps
> > 
> >
> > * This point release fixes several issues with CRUSH that trigger
> >   excessive data migration when adjusting OSD weights.  These are most
> >   obvious when a very small weight change (e.g., a change from 0 to
> >   .01) triggers a large amount of movement, but the same set of bugs
> >   can also lead to excessive (though less noticeable) movement in
> >   other cases.
> >
> >   However, because the bug may already have affected your cluster,
> >   fixing it may trigger movement *back* to the more correct location.
> >   For this reason, you must manually opt-in to the fixed behavior.
> >
> >   In order to set the new tunable to correct the behavior::
> >
> >  ceph osd crush set-tunable straw_calc_version 1
> >
> >   Note that this change will have no immediate effect.  However, from
> >   this point forward, any 'straw' bucket in your CRUSH map that is
> >   adjusted will get non-buggy internal weights, and that transition
> >   may trigger some rebalancing.
> >
> >   You can estimate how much rebalancing will eventually be necessary
> >   on your cluster with::
> >
> >  ceph osd getcrushmap -o /tmp/cm
> >  crushtool -i /tmp/cm --num-rep 3 --test --show-mappings > /tmp/a
> 2>&1
> >  crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
> >  crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
> >  crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings > /tmp/b
> > 2>&1
> >  wc -l /tmp/a  # num total mappings
> >  diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings
> >
> >Divide the total number of lines in /tmp/a with the number of lines
> >changed.  We've found that most clusters are under 10%.
> >
> >You can force all of this rebalancing to happen at once with::
> >
> >  ceph osd crush reweight-all
> >
> >Otherwise, it will happen at some unknown point in the future when
> >CRUSH weights are next adjusted.
> >
> > Notable Changes
> > ---
> >
> > * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
> > * crush: fix straw bucket weight calculation, add straw_calc_version
> >   tunable (#10095 Sage Weil)
> > * crush: fix tree bucket (Rongzu Zhu)
> > * crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
> > * crushtool: add --reweight (Sage Weil)
> > * librbd: complete pending operations before losing image (#10299 Jason
> >   Dillaman)
> > * librbd: fix read caching performance regression (#9854 Jason Dillaman)
> > * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman)
> > * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
> > * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
> > * osd: handle no-op write with snapshot (#10262 Sage Weil)
> > * radosgw-admi
> >
> >
> >
> >
> > On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote:
> > >>> VMs are running on the same nodes than OSD
> > > Are you sure that you didn't some kind of out of memory.
> > > pg rebalance can be memory hungry. (depend how many osd you have).
> >
> > 2 OSD per host, and 5 hosts in this cluster.
> > hosts h
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Michael Kuriger

I always keep my pg number a power of 2.  So I’d go from 2048 to 4096.  I’m not 
sure if this is the safest way, but it’s worked for me.


[yp]



Michael Kuriger
Sr. Unix Systems Engineer
• mk7...@yp.com |• 818-649-7235


From: Chu Duc Minh mailto:chu.ducm...@gmail.com>>
Date: Monday, March 16, 2015 at 7:49 AM
To: Florent B mailto:flor...@coppint.com>>
Cc: "ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

I'm using the latest Giant and have the same issue. When i increase PG_num of a 
pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, 
some VMs die (Qemu-kvm process die).
My physical servers (host VMs) running kernel 3.13 and use librbd.
I think it's a bug in librbd with crushmap.
(I set crush_tunables3 on my ceph cluster, does it make sense?)

Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 
each times is a safe & good way)

Regards,

On Mon, Mar 16, 2015 at 8:50 PM, Florent B 
mailto:flor...@coppint.com>> wrote:
We are on Giant.

On 03/16/2015 02:03 PM, Azad Aliyar wrote:
>
> May I know your ceph version.?. The latest version of firefly 80.9 has
> patches to avoid excessive data migrations during rewighting osds. You
> may need set a tunable inorder make this patch active.
>
> This is a bugfix release for firefly.  It fixes a performance regression
> in librbd, an important CRUSH misbehavior (see below), and several RGW
> bugs.  We have also backported support for flock/fcntl locks to ceph-fuse
> and libcephfs.
>
> We recommend that all Firefly users upgrade.
>
> For more detailed information, see
>   
> http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt
>
> Adjusting CRUSH maps
> 
>
> * This point release fixes several issues with CRUSH that trigger
>   excessive data migration when adjusting OSD weights.  These are most
>   obvious when a very small weight change (e.g., a change from 0 to
>   .01) triggers a large amount of movement, but the same set of bugs
>   can also lead to excessive (though less noticeable) movement in
>   other cases.
>
>   However, because the bug may already have affected your cluster,
>   fixing it may trigger movement *back* to the more correct location.
>   For this reason, you must manually opt-in to the fixed behavior.
>
>   In order to set the new tunable to correct the behavior::
>
>  ceph osd crush set-tunable straw_calc_version 1
>
>   Note that this change will have no immediate effect.  However, from
>   this point forward, any 'straw' bucket in your CRUSH map that is
>   adjusted will get non-buggy internal weights, and that transition
>   may trigger some rebalancing.
>
>   You can estimate how much rebalancing will eventually be necessary
>   on your cluster with::
>
>  ceph osd getcrushmap -o /tmp/cm
>  crushtool -i /tmp/cm --num-rep 3 --test --show-mappings > /tmp/a 2>&1
>  crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
>  crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
>  crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings > /tmp/b
> 2>&1
>  wc -l /tmp/a  # num total mappings
>  diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings
>
>Divide the total number of lines in /tmp/a with the number of lines
>changed.  We've found that most clusters are under 10%.
>
>You can force all of this rebalancing to happen at once with::
>
>  ceph osd crush reweight-all
>
>Otherwise, it will happen at some unknown point in the future when
>CRUSH weights are next adjusted.
>
> Notable Changes
> ---
>
> * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
> * crush: fix straw bucket weight calculation, add straw_calc_version
>   tunable (#10095 Sage Weil)
> * crush: fix tree bucket (Rongzu Zhu)
> * crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
> * crushtool: add --reweight (Sage Weil)
> * librbd: complete pending operations before losing image (#10299 Jason
>   Dillaman)
> * librbd: fix read caching performance regression (#9854 Jason Dillaman)
> * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman)
> * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)
> * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)
> * osd: handle no-op write with snapshot (#10262 Sage Weil)
> * radosgw-admi
>
>
>
>
> On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote:
> >>> VMs are running on the same nodes than OSD
> > Are you sure that you didn't some kind of out of memory.
> > pg rebalance can be memory hungry. (depend how many o

Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !

2015-03-16 Thread Chu Duc Minh

@Michael Kuriger: when ceph/librbd operate normally, i know that double the
pg_num is the safe way. But when it has problem, i think double it can make
many many VMs die (maybe >= 50%?)


On Mon, Mar 16, 2015 at 9:53 PM, Michael Kuriger  wrote:

>   I always keep my pg number a power of 2.  So I’d go from 2048 to 4096.
> I’m not sure if this is the safest way, but it’s worked for me.
>
>
>
> [image: yp]
>
>
>
> Michael Kuriger
>
> Sr. Unix Systems Engineer
>
> * mk7...@yp.com |( 818-649-7235
>
>   From: Chu Duc Minh 
> Date: Monday, March 16, 2015 at 7:49 AM
> To: Florent B 
> Cc: "ceph-users@lists.ceph.com" 
> Subject: Re: [ceph-users] [SPAM] Changing pg_num => RBD VM down !
>
>I'm using the latest Giant and have the same issue. When i increase
> PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase
> from 2148 to 2400, some VMs die (Qemu-kvm process die).
>  My physical servers (host VMs) running kernel 3.13 and use librbd.
>  I think it's a bug in librbd with crushmap.
>  (I set crush_tunables3 on my ceph cluster, does it make sense?)
>
> Do you know a way to safely increase PG_num? (I don't think increase
> PG_num 100 each times is a safe & good way)
>
>  Regards,
>
> On Mon, Mar 16, 2015 at 8:50 PM, Florent B  wrote:
>
>> We are on Giant.
>>
>> On 03/16/2015 02:03 PM, Azad Aliyar wrote:
>> >
>> > May I know your ceph version.?. The latest version of firefly 80.9 has
>> > patches to avoid excessive data migrations during rewighting osds. You
>> > may need set a tunable inorder make this patch active.
>> >
>> > This is a bugfix release for firefly.  It fixes a performance regression
>> > in librbd, an important CRUSH misbehavior (see below), and several RGW
>> > bugs.  We have also backported support for flock/fcntl locks to
>> ceph-fuse
>> > and libcephfs.
>> >
>> > We recommend that all Firefly users upgrade.
>> >
>> > For more detailed information, see
>> >   http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt
>> 
>> >
>> > Adjusting CRUSH maps
>> > 
>> >
>> > * This point release fixes several issues with CRUSH that trigger
>> >   excessive data migration when adjusting OSD weights.  These are most
>> >   obvious when a very small weight change (e.g., a change from 0 to
>> >   .01) triggers a large amount of movement, but the same set of bugs
>> >   can also lead to excessive (though less noticeable) movement in
>> >   other cases.
>> >
>> >   However, because the bug may already have affected your cluster,
>> >   fixing it may trigger movement *back* to the more correct location.
>> >   For this reason, you must manually opt-in to the fixed behavior.
>> >
>> >   In order to set the new tunable to correct the behavior::
>> >
>> >  ceph osd crush set-tunable straw_calc_version 1
>> >
>> >   Note that this change will have no immediate effect.  However, from
>> >   this point forward, any 'straw' bucket in your CRUSH map that is
>> >   adjusted will get non-buggy internal weights, and that transition
>> >   may trigger some rebalancing.
>> >
>> >   You can estimate how much rebalancing will eventually be necessary
>> >   on your cluster with::
>> >
>> >  ceph osd getcrushmap -o /tmp/cm
>> >  crushtool -i /tmp/cm --num-rep 3 --test --show-mappings > /tmp/a
>> 2>&1
>> >  crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2
>> >  crushtool -i /tmp/cm2 --reweight -o /tmp/cm2
>> >  crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings > /tmp/b
>> > 2>&1
>> >  wc -l /tmp/a  # num total mappings
>> >  diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings
>> >
>> >Divide the total number of lines in /tmp/a with the number of lines
>> >changed.  We've found that most clusters are under 10%.
>> >
>> >You can force all of this rebalancing to happen at once with::
>> >
>> >  ceph osd crush reweight-all
>> >
>> >Otherwise, it will happen at some unknown point in the future when
>> >CRUSH weights are next adjusted.
>> >
>> > Notable Changes
>> > ---
>> >
>> > * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)
>> > * crush: fix straw bucket weight calculation, add straw_calc_version
>> >   tunable (#10095 Sage Weil)
>> > * crush: fix tree bucket (Rongzu Zhu)
>> > * crush: fix underflow of tree weights (Loic Dachary, Sage Weil)
>> > * crushtool: add --reweight (Sage Weil)
>> > * librbd: complete pending operations before losing image (#10299 Jason
>> >   Dillaman)
>> > * librbd: fix read caching performance regression (#9854 Jason Dillaman)
>> > * librbd: gracefully handle deleted/renamed pools (#10270 Jason
>> Dillaman)
>> > * mon: fix dump of chooseleaf_vary_r tunable (Sage W

Re: [ceph-users] Calamari - Data

2015-03-16 Thread John Spray


Sumit,

You may have better luck on the ceph-calamari mailing list.  Anyway - 
calamari uses graphite to handle metrics, and graphite does indeed write 
them to files.


John

On 11/03/2015 05:09, Sumit Gaur wrote:

Hi
I have a basic architecture related question. I know Calamari collect 
system usages data (diamond collector) using perfrormance counters. I 
need to knwo if all the system performance data that calamari shows 
remains in memory or it usages files to store that.

Thanks
sumit


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: delayed objects deletion ?

2015-03-16 Thread John Spray


On 14/03/2015 09:22, Florent B wrote:

Hi,

What do you call "old MDS" ? I'm on Giant release, it is not very old...
With CephFS we have a special definition of "old" that is anything that 
doesn't have the very latest bug fixes ;-)


There have definitely been fixes to stray file handling[1] between giant 
and hammer.  Since with giant you're using a version that is neither 
latest nor LTS, I'd suggest you upgrade to hammer.  Hammer also includes 
some new perf counters related to strays[2] that will allow you to see 
how the purging is (or isn't) progressing.


If you can reproduce this on hammer, then please capture "ceph daemon 
mds. session ls" and "ceph mds tell mds. dumpcache 
/tmp/cache.txt", in addition to the procedure to reproduce.  Ideally 
logs with "debug mds = 10" as well.


Cheers,
John

1.
http://tracker.ceph.com/issues/10387
http://tracker.ceph.com/issues/10164

2.
http://tracker.ceph.com/issues/10388


And I tried restarting both but it didn't solve my problem.

Will it be OK in Hammer ?

On 03/13/2015 04:27 AM, Yan, Zheng wrote:

On Fri, Mar 13, 2015 at 1:17 AM, Florent B  wrote:

Hi all,

I test CephFS again on Giant release.

I use ceph-fuse.

After deleting a large directory (few hours ago), I can see that my pool
still contains 217 GB of objects.

Even if my root directory on CephFS is empty.

And metadata pool is 46 MB.

Is it expected ? If not, how to debug this ?

Old mds does not work well in this area. Try umounting clients and
restarting MDS.

Regards
Yan, Zheng



Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: authorizations ?

2015-03-16 Thread John Spray


On 13/03/2015 11:51, Florent B wrote:

Hi all,

My question is about user management in CephFS.

Is it possible to restrict a CephX user to access some subdirectories ?
Not yet.  The syntax for setting a "path=" part in the authorization 
caps for a cephx user exists, but the code for enforcing it isn't done yet.


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: delayed objects deletion ?

2015-03-16 Thread Florent B

On 03/16/2015 05:14 PM, John Spray wrote:
> With CephFS we have a special definition of "old" that is anything
> that doesn't have the very latest bug fixes ;-)
>
> There have definitely been fixes to stray file handling[1] between
> giant and hammer.  Since with giant you're using a version that is
> neither latest nor LTS, I'd suggest you upgrade to hammer.  Hammer
> also includes some new perf counters related to strays[2] that will
> allow you to see how the purging is (or isn't) progressing.
>
> If you can reproduce this on hammer, then please capture "ceph daemon
> mds. session ls" and "ceph mds tell mds.
> dumpcache /tmp/cache.txt", in addition to the procedure to reproduce. 
> Ideally logs with "debug mds = 10" as well.
>
> Cheers,
> John
>
> 1.
> http://tracker.ceph.com/issues/10387
> http://tracker.ceph.com/issues/10164
>
> 2.
> http://tracker.ceph.com/issues/10388

Thank you John :)

Hammer is not released yet, is it ?
Is it 'safe' to upgrade a production cluster to 0.93 ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: delayed objects deletion ?

2015-03-16 Thread John Spray


On 16/03/2015 16:30, Florent B wrote:
Thank you John :) Hammer is not released yet, is it ? Is it 'safe' to 
upgrade a production cluster to 0.93 ? 
I keep forgetting that -- yes, I should have added "...when it's 
released" :-)


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rados duplicate object name

2015-03-16 Thread Gregory Farnum

This is expected behavior - "put" uses write_full which is an object
overwrite command.
On Thu, Mar 12, 2015 at 4:17 PM Kapil Sharma  wrote:

> Hi Cephers,
>
> Has anyone tested the behavior of rados by adding an object to the
> cluster with an object name which already exists in the cluster ?
> with command - "rados put -p testpool myobject testfile"
>
> I notice that even if I already have an object called 'myobject' in
> testpool,
> I can still add a new object with same name and it overwrites my previous
> object without any error message.
>
> With RBD this is not an issue. I do see a proper error message when
> I try to add an RBD with name which already exists -
> rbd: create error: (17) File exists2015-03-13 00:16:09.800355 7fe2c4c47780
> -1 librbd: rbd image foo already exists
>
>
>
> Regards,
> Kapil.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OS file Cache, Ceph RBD cache and Network files systems

2015-03-16 Thread Stéphane DUGRAVOT

Hi Cephers, 

Our university could deploy ceph. The goal is to store datas for research 
laboratories (non-HPC) . To do this, we plan to use Ceph with RBD (mount block 
device) from a NFS ( or CIFS ) server (ceph client) to workstations in 
laboratories. According to our tests, the OS (ubuntu or centos...) that map the 
RBD block implements file system write cache (vm.dirty_ratio, etc ...). In that 
case, the NFS server will always perform writes to workstations whereas it has 
not finished writing datas to Ceph cluster - a nd regardless of whether the RBD 
cache is enabled or not in the config [client] section. 

My questions: 


1. Does the activation of RBD cache is useful only when it combines 
Virtuals Machnies (where QEMU can access an image as a virtual block device 
directly via librbd) ? 
2. Is it common to use Ceph, with RBD to share network file systems ? 
3. And if so, what are the recommendations concerning the OS cache ? 

Thanks a lot. 
Stephane. 

-- 
Université de Lorraine 
Stéphane DUGRAVOT - Direction du numérique - Infrastructure 
Jabber : stephane.dugra...@univ-lorraine.fr 
Tél.: +33 3 83 68 20 98 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: stripe_unit=65536 + object_size=1310720 => pipe.fault, server, going to standby

2015-03-16 Thread John Spray


On 11/03/2015 08:59, Florent B wrote:

Hi all,

I'm testing CephFS with Giant and I have a problem when I set these 
attrs :


setfattr -n ceph.dir.layout.stripe_unit -v "65536" pool_cephfs01/
setfattr -n ceph.dir.layout.stripe_count -v "1" pool_cephfs01/
setfattr -n ceph.dir.layout.object_size -v "1310720" pool_cephfs01/
setfattr -n ceph.dir.layout.pool -v "cephfs01" pool_cephfs01/

When a client writes files in pool_cephfs01/, It got "failed: 
Transport endpoint is not connected (107)" and these errors on MDS :


10.111.0.6:6801/41706 >> 10.111.17.118:0/9384 pipe(0x5e3a580 sd=27 
:6801 s=2 pgs=2 cs=1 l=0 c=0x6a8d1e0).fault, server, going to standby
Thanks for the bug report.  I can reproduce a fuse client crash using 
these layout settings on master: http://tracker.ceph.com/issues/11120




When I set stripe_unit=1048576 & object_size=1048576, it seems working.

What are the "rules" for stripe_unit & object_size ?
In addition to Ilya's explanation in this thread, these references may 
be useful for you (CephFS striping is the same as RBD striping):

- http://docs.ceph.com/docs/master/architecture/#data-striping
- http://ceph.com/docs/master/man/8/rbd/#striping

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mapping users to different rgw pools

2015-03-16 Thread Craig Lewis

Yes, the placement target feature is logically separate from multi-zone
setups.  Placement targets are configured in the region though, which
somewhat muddies the issue.

Placement targets are useful feature for multi-zone, so different zones in
a cluster don't share the same disks.  Federation setup is the only place
I've seen any discussion about the topic.  Even that is just a brief
mention.  I didn't see any documentation directly talking about setting up
placement targets, even in the federation guides.

It looks like you'll need to edit the default region to add the placement
targets, but you won't need to setup zones.  As far as I can tell, You'll
have to piece together what you need from the federation setup and some
experimentation.  I highly recommend a test VM that you can experiment on
before attempting anything in production.

On Sun, Mar 15, 2015 at 11:53 PM, Sreenath BH  wrote:

> Thanks.
>
> Is this possible outside of multi-zone setup. (With only one Zone)?
>
> For example, I want to have pools with different replication
> factors(or erasure codings) and map users to these pools.
>
> -Sreenath
>
>
> On 3/13/15, Craig Lewis  wrote:
> > Yes, RadosGW has the concept of Placement Targets and Placement Pools.
> You
> > can create a target, and point it a set of RADOS pools.  Those pools can
> be
> > configured to use different storage strategies by creating different
> > crushmap rules, and assigning those rules to the pool.
> >
> > RGW users can be assigned a default placement target.  When they create a
> > bucket, they can either specify the target, or use their default one.
> All
> > objects in a bucket are stored according to the bucket's placement
> target.
> >
> >
> > I haven't seen a good guide for making use of these features.  The best
> > guide I know of is the Federation guide (
> > http://ceph.com/docs/giant/radosgw/federated-config/), but it only
> briefly
> > mentions placement targets.
> >
> >
> >
> > On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH 
> wrote:
> >
> >> Hi all,
> >>
> >> Can one Radow gateway support more than one pool for storing objects?
> >>
> >> And as a follow-up question, is there a way to map different users to
> >> separate rgw pools so that their obejcts get stored in different
> >> pools?
> >>
> >> thanks,
> >> Sreenath
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd laggy algorithm

2015-03-16 Thread Gregory Farnum

On Wed, Mar 11, 2015 at 8:40 AM, Artem Savinov  wrote:
> hello.
> ceph transfers osd node in the down status by default , after receiving 3
> reports about disabled nodes. Reports are sent per   "osd heartbeat grace"
> seconds, but the settings of "mon_osd_adjust_heartbeat_gratse = true,
> mon_osd_adjust_down_out_interval = true" timeout to transfer nodes in down
> status may vary. Tell me please: what algorithm enables changes timeout for
> the transfer nodes occur in down/out status and which parameters are
> affected?
> thanks.

The monitors keep track of which detected failures are incorrect
(based on reports from the marked-down/out OSDs) and build up an
expectation about how often the failures are correct based on an
exponential backoff of the data points. You can look at the code in
OSDMonitor.cc if you're interested, but basically they apply that
expectation to modify the down interval and the down-out interval to a
value large enough that they believe the OSD is really down (assuming
these config options are set). It's not terribly interesting. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum

On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk  wrote:
>
> I’m not sure if it’s something I’m doing wrong or just experiencing an 
> oddity, but when my cache tier flushes dirty blocks out to the base tier, the 
> writes seem to hit the OSD’s straight away instead of coalescing in the 
> journals, is this correct?
>
> For example if I create a RBD on a standard 3 way replica pool and run fio 
> via librbd 128k writes, I see the journals take all the io’s until I hit my 
> filestore_min_sync_interval and then I see it start writing to the underlying 
> disks.
>
> Doing the same on a full cache tier (to force flushing)  I immediately see 
> the base disks at a very high utilisation. The journals also have some write 
> IO at the same time. The only other odd thing I can see via iostat is that 
> most of the time whilst I’m running Fio, is that I can see the underlying 
> disks doing very small write IO’s of around 16kb with an occasional big burst 
> of activity.
>
> I know erasure coding+cache tier is slower than just plain replicated pools, 
> but even with various high queue depths I’m struggling to get much above 
> 100-150 iops compared to a 3 way replica pool which can easily achieve 
> 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked 
> difference and I’m wondering if this strange journal behaviour is the cause.
>
> Does anyone have any ideas?

If you're running a full cache pool, then on every operation touching
an object which isn't in the cache pool it will try and evict an
object. That's probably what you're seeing.

Cache pool in general are only a wise idea if you have a very skewed
distribution of data "hotness" and the entire hot zone can fit in
cache at once.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PGs stuck unclean "active+remapped" after an osd marked out

2015-03-16 Thread Gregory Farnum

On Wed, Mar 11, 2015 at 3:49 PM, Francois Lafont  wrote:
> Hi,
>
> I was always in the same situation: I couldn't remove an OSD without
> have some PGs definitely stuck to the "active+remapped" state.
>
> But I remembered I read on IRC that, before to mark out an OSD, it
> could be sometimes a good idea to reweight it to 0. So, instead of
> doing [1]:
>
> ceph osd out 3
>
> I have tried [2]:
>
> ceph osd crush reweight osd.3 0 # waiting for the rebalancing...
> ceph osd out 3
>
> and it worked. Then I could remove my osd with the online documentation:
> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
>
> Now, the osd is removed and my cluster is HEALTH_OK. \o/
>
> Now, my question is: why my cluster was definitely stuck to "active+remapped"
> with [1] but was not with [2]? Personally, I have absolutely no explanation.
> If you have an explanation, I'd love to know it.

If I remember/guess correctly, if you mark an OSD out it won't
necessarily change the weight of the bucket above it (ie, the host),
whereas if you change the weight of the OSD then the host bucket's
weight changes. That makes for different mappings, and since you only
have a couple of OSDs per host (normally: hurray!) and not many hosts
(normally: sadness) then marking one OSD out makes things harder for
the CRUSH algorithm.
-Greg

>
> Should the "reweight" command be present in the online documentation?
> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
> If yes, I can make a pull request on the doc with pleasure. ;)
>
> Regards.
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] client-ceph [can not connect from client][connect protocol feature mismatch]

2015-03-16 Thread Sonal Dubey

Thanks a lot Stephane and Kamil,

Your reply was really helpful. I needed a different version of ceph client
on my client machine. Initially my java application using librados was
throwing connection time out. Then I considered querying ceph from command
line (ceph --id ...), which was giving the error -



2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 >>
10.138.23.241:6789/0pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
feature mismatch, my 1ffa < peer 42041ffa missing 4204


>From the hits given in your mail i tried -

wget -q -O- '
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
'
| sudo apt-key add -
wget -q -O- '
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc
'
| sudo apt-key add -
echo deb http://ceph.com/packages/ceph-extras/debian

$(lsb_release
-sc) main | sudo tee /etc/apt/sources.list.d/ceph-extras.list
echo deb http://ceph.com/debian-firefly/

$(lsb_release
-sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt-get install ceph-common

to verify:
ceph --id brts --keyring=/etc/ceph/ceph.client.brts.keyring health
HEALTH_OK

Thanks for the reply.

-Sonal


On Fri, Mar 6, 2015 at 5:50 AM, Stéphane DUGRAVOT <
stephane.dugra...@univ-lorraine.fr> wrote:

> Hi Sonal,
> You can refer to this doc to identify your problem.
> Your error code is 4204, so
>
>- 4000 upgrade to kernel 3.9
>-  200 CEPH_FEATURE_CRUSH_TUNABLES2
>- 4 CEPH_FEATURE_CRUSH_TUNABLES
>
>
>-
>http://ceph.com/planet/feature-set-mismatch-error-on-ceph-kernel-client/
>
> Stephane.
>
> --
>
> Hi,
>
> I am newbie for ceph, and ceph-user group. Recently I have been working on
> a ceph client. It worked on all the environments while when i tested on the
> production, it is not able to connect to ceph.
>
> Following are the operating system details and error. If someone has seen
> this problem before, any help is really appreciated.
>
> OS -
>
> lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04.2 LTS
> Release: 12.04
> Codename: precise
>
> 2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 >>
> 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
> feature mismatch, my 1ffa < peer 42041ffa missing 4204
> 2015-03-05 13:37:17.635776 7f5191deb700 -- 10.8.25.112:0/2487 >>
> 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol
> feature mismatch, my 1ffa < peer 42041ffa missing 4204
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Georgios Dimitrakakis


Hi all!

I have recently updated to CEPH version 0.80.9 (latest Firefly release) 
which presumably

supports direct upload.

I 've tried to upload a file using this functionality and it seems that 
is working
for files up to 5GB. For files above 5GB there is an error. I believe 
that this is because

of a hardcoded limit:

#define RGW_MAX_PUT_SIZE(5ULL*1024*1024*1024)


Is there a way to increase that limit other than compiling CEPH from 
source?


Could we somehow put it as a configuration parameter?


Looking forward to hear from you!


Regards,


George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PGs stuck unclean "active+remapped" after an osd marked out

2015-03-16 Thread Francois Lafont

Hi,

Gregory Farnum a wrote :

> If I remember/guess correctly, if you mark an OSD out it won't
> necessarily change the weight of the bucket above it (ie, the host),
> whereas if you change the weight of the OSD then the host bucket's
> weight changes.

I can just say that, indeed, I have noticed exactly what you describe
in the ouput of of "ceph osd tree".

> That makes for different mappings, and since you only
> have a couple of OSDs per host (normally: hurray!)

er, er... no, I have 10 OSDs in the first "osd node" and 11 OSDs in the
second "osd node" (see my fisrt message).

> and not many hosts (normally: sadness)

Yes, only I have only 2 "osd nodes" (and 3 monitors).

> then marking one OSD out makes things harder for the CRUSH algorithm.

Ah, Ok. So my cluster is too little for Ceph. ;)
Thanks for your answer Greg, I will follow the pull-request with attention.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Gregory Farnum

On Mon, Mar 16, 2015 at 11:14 AM, Georgios Dimitrakakis
 wrote:
> Hi all!
>
> I have recently updated to CEPH version 0.80.9 (latest Firefly release)
> which presumably
> supports direct upload.
>
> I 've tried to upload a file using this functionality and it seems that is
> working
> for files up to 5GB. For files above 5GB there is an error. I believe that
> this is because
> of a hardcoded limit:
>
> #define RGW_MAX_PUT_SIZE(5ULL*1024*1024*1024)
>
>
> Is there a way to increase that limit other than compiling CEPH from source?

No.

>
> Could we somehow put it as a configuration parameter?

Maybe, but I'm not sure if Yehuda would want to take it upstream or
not. This limit is present because it's part of the S3 spec. For
larger objects you should use multi-part upload, which can get much
bigger.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PGs stuck unclean "active+remapped" after an osd marked out

2015-03-16 Thread Craig Lewis

>
>
> If I remember/guess correctly, if you mark an OSD out it won't
> necessarily change the weight of the bucket above it (ie, the host),
> whereas if you change the weight of the OSD then the host bucket's
> weight changes.
> -Greg



That sounds right.  Marking an OSD out is a ceph osd reweight, not a ceph
osd crush reweight.

Experimentally confirmed.  I have an OSD out right now, and the host's
crush weight is the same as the other hosts' crush weight.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Craig Lewis

>
>
> Maybe, but I'm not sure if Yehuda would want to take it upstream or
> not. This limit is present because it's part of the S3 spec. For
> larger objects you should use multi-part upload, which can get much
> bigger.
> -Greg
>
>
Note that the multi-part upload has a lower limit of 4MiB per part, and the
direct upload has an upper limit of 5GiB.

So you have to use both methods - direct upload for small files, and
multi-part upload for big files.

Your best bet is to use the Amazon S3 libraries.  They have functions that
take care of it for you.


I'd like to see this mentioned in the Ceph documentation someplace.  When I
first encountered the issue, I couldn't find a limit in the RadosGW
documentation anywhere.  I only found the 5GiB limit in the Amazon API
documentation, which lead me to test on RadosGW.  Now that I know it was
done to preserve Amazon compatibility, I don't want to override the value
anymore.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] query about mapping of Swift/S3 APIs to Ceph cluster APIs

2015-03-16 Thread Craig Lewis

On Sat, Mar 14, 2015 at 3:04 AM, pragya jain  wrote:

> Hello all!
>
> I am working on Ceph object storage architecture from last few months.
>
> I am unable to search  a document which can describe how Ceph object
> storage APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs
> (librados APIs) to store the data at Ceph storage cluster.
>
> As the documents say: Radosgw, a gateway interface for ceph object storage
> users, accept user request to store or retrieve data in the form of Swift
> APIs or S3 APIs and convert the user's request in RADOS request.
>
> Please help me in knowing
> 1. how does Radosgw convert user request to RADOS request ?
> 2. how are HTTP requests mapped with RADOS request?
>
>
The RadosGW daemon takes care of that.  It's an application that sits on
top of RADOS.

For HTTP, there are a couple ways.  The older way has Apache accepting the
HTTP request, then forwarding that to the RadosGW daemon using FastCGI.
Newer versions support RadosGW handling the HTTP directly.

For the full details, you'll want to check out the source code at
https://github.com/ceph/ceph

If you're not interested enough to read the source code (I wasn't :-) ),
setup a test cluster.  Create a user, bucket, and object, and look at the
contents of the rados pools.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Shadow files

2015-03-16 Thread Craig Lewis

Out of curiousity, what's the frequency of the peaks and troughs?

RadosGW has configs on how long it should wait after deleting before
garbage collecting, how long between GC runs, and how many objects it can
GC in per run.

The defaults are 2 hours, 1 hour, and 32 respectively.  Search
http://docs.ceph.com/docs/master/radosgw/config-ref/ for "rgw gc".

If your peaks and troughs have a frequency less than 1 hour, then GC is
going to delay and alias the disk usage w.r.t. the object count.

If you have millions of objects, you probably need to tweak those values.
If RGW is only GCing 32 objects an hour, it's never going to catch up.


Now that I think about it, I bet I'm having issues here too.  I delete more
than (32*24) objects per day...



On Sun, Mar 15, 2015 at 4:41 PM, Ben  wrote:

> It is either a problem with CEPH, Civetweb or something else in our
> configuration.
> But deletes in user buckets is still leaving a high number of old shadow
> files. Since we have millions and millions of objects, it is hard to
> reconcile what should and shouldnt exist.
>
> Looking at our cluster usage, there are no troughs, it is just a rising
> peak.
> But when looking at users data usage, we can see peaks and troughs as you
> would expect as data is deleted and added.
>
> Our ceph version 0.80.9
>
> Please ideas?
>
> On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote:
>
>> - Original Message -
>>
>>> From: "Ben" 
>>> To: ceph-us...@ceph.com
>>> Sent: Wednesday, March 11, 2015 8:46:25 PM
>>> Subject: Re: [ceph-users] Shadow files
>>>
>>> Anyone got any info on this?
>>>
>>> Is it safe to delete shadow files?
>>>
>>
>> It depends. Shadow files are badly named objects that represent part
>> of the objects data. They are only safe to remove if you know that the
>> corresponding objects no longer exist.
>>
>> Yehuda
>>
>>
>>> On 2015-03-11 10:03, Ben wrote:
>>> > We have a large number of shadow files in our cluster that aren't
>>> > being deleted automatically as data is deleted.
>>> >
>>> > Is it safe to delete these files?
>>> > Is there something we need to be aware of when deleting them?
>>> > Is there a script that we can run that will delete these safely?
>>> >
>>> > Is there something wrong with our cluster that it isn't deleting these
>>> > files when it should be?
>>> >
>>> > We are using civetweb with radosgw, with tengine ssl proxy infront of
>>> > it
>>> >
>>> > Any advice please
>>> > Thanks
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Shadow files

2015-03-16 Thread Gregory Farnum

On Mon, Mar 16, 2015 at 12:12 PM, Craig Lewis  wrote:
> Out of curiousity, what's the frequency of the peaks and troughs?
>
> RadosGW has configs on how long it should wait after deleting before garbage
> collecting, how long between GC runs, and how many objects it can GC in per
> run.
>
> The defaults are 2 hours, 1 hour, and 32 respectively.  Search
> http://docs.ceph.com/docs/master/radosgw/config-ref/ for "rgw gc".
>
> If your peaks and troughs have a frequency less than 1 hour, then GC is
> going to delay and alias the disk usage w.r.t. the object count.
>
> If you have millions of objects, you probably need to tweak those values.
> If RGW is only GCing 32 objects an hour, it's never going to catch up.
>
>
> Now that I think about it, I bet I'm having issues here too.  I delete more
> than (32*24) objects per day...

Uh, that's not quite what rgw_gc_max_objs mean. That param configures
how the garbage control data objects and internal classes are sharded,
and each grouping will only delete one object at a time. So it
controls the parallelism, but not the total number of objects!

Also, Yehuda says that changing this can be a bit dangerous because it
currently needs to be consistent across any program doing or
generating GC work.
-Greg

>
>
>
> On Sun, Mar 15, 2015 at 4:41 PM, Ben  wrote:
>>
>> It is either a problem with CEPH, Civetweb or something else in our
>> configuration.
>> But deletes in user buckets is still leaving a high number of old shadow
>> files. Since we have millions and millions of objects, it is hard to
>> reconcile what should and shouldnt exist.
>>
>> Looking at our cluster usage, there are no troughs, it is just a rising
>> peak.
>> But when looking at users data usage, we can see peaks and troughs as you
>> would expect as data is deleted and added.
>>
>> Our ceph version 0.80.9
>>
>> Please ideas?
>>
>> On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote:
>>>
>>> - Original Message -

 From: "Ben" 
 To: ceph-us...@ceph.com
 Sent: Wednesday, March 11, 2015 8:46:25 PM
 Subject: Re: [ceph-users] Shadow files

 Anyone got any info on this?

 Is it safe to delete shadow files?
>>>
>>>
>>> It depends. Shadow files are badly named objects that represent part
>>> of the objects data. They are only safe to remove if you know that the
>>> corresponding objects no longer exist.
>>>
>>> Yehuda
>>>

 On 2015-03-11 10:03, Ben wrote:
 > We have a large number of shadow files in our cluster that aren't
 > being deleted automatically as data is deleted.
 >
 > Is it safe to delete these files?
 > Is there something we need to be aware of when deleting them?
 > Is there a script that we can run that will delete these safely?
 >
 > Is there something wrong with our cluster that it isn't deleting these
 > files when it should be?
 >
 > We are using civetweb with radosgw, with tengine ssl proxy infront of
 > it
 >
 > Any advice please
 > Thanks
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Firefly, cephfs issues: different unix rights depending on the client and ls are slow

2015-03-16 Thread Gregory Farnum

On Sun, Mar 15, 2015 at 7:06 PM, Yan, Zheng  wrote:
> On Sat, Mar 14, 2015 at 7:03 AM, Scottix  wrote:
>> ...
>>
>>
>>> The time variation is caused cache coherence. when client has valid
>>> information
>>> in its cache, 'stat' operation will be fast. Otherwise the client need to
>>> send
>>> request to MDS and wait for reply, which will be slow.
>>
>>
>> This sounds like the behavior I had with CephFS giving me question marks.
>> When I had a directory with a large amount of files in it and the first ls
>> -la took a while to populate and ended with some unknown stats. The second
>> time I did an ls -la it ran quick with no question marks. My inquiry was if
>> there is a timeout that could occur? since it has to go ask the mds on a
>> different machine it seems plausible that the full response is not coming
>> back in time or fails to get all stats at some point.
>
> unknown stats shouldn't happen. which kernel are you using? can you reproduce
> this issue with ceph-fuse?

See https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17445.html
for more context. It's an old kernel and happens on ceph-fuse.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-16 Thread Gregory Farnum

On Sat, Mar 14, 2015 at 1:56 AM, Chris Murray  wrote:
> Good evening all,
>
> Just had another quick look at this with some further logging on and thought 
> I'd post the results in case anyone can keep me moving in the right direction.
>
> Long story short, some OSDs just don't appear to come up after one failing 
> after another. Dealing with one in isolation, after a load of IO, it never 
> starts. The last in the log looks like this:
>
> ...
> 2015-03-13 18:43:11.875392 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:11.876568 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeyrange 
> meta/39e3fb/pglog_4.57c/0//-1 
> [00.,4294967295.18446744073709551615]
> 2015-03-13 18:43:11.876598 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) get_omap_iterator 
> meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.952511 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.952878 7f29d1e98780 10 filestore oid: 
> 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3
> 2015-03-13 18:43:11.952892 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:11.961127 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.961516 7f29d1e98780 10 filestore oid: 
> 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4
> 2015-03-13 18:43:11.961529 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:11.965687 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 
> meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.966082 7f29d1e98780 10 filestore oid: 
> 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5
> 2015-03-13 18:43:11.966095 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 13288339 
> open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339
> 2015-03-13 18:43:11.989861 7f29d1e98780  3 journal journal_replay: r = 0, 
> op_seq now 13288339
> 2015-03-13 18:43:11.989896 7f29d1e98780  2 journal read_entry 3951706112 : 
> seq 13288340 1755 bytes
> 2015-03-13 18:43:11.989900 7f29d1e98780  3 journal journal_replay: applying 
> op seq 13288340
> 2015-03-13 18:43:11.989903 7f29d1e98780 10 journal op_apply_start 13288340 
> open_ops 0 -> 1
> 2015-03-13 18:43:11.989906 7f29d1e98780 10 
> filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x2750480
> 2015-03-13 18:43:11.989919 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1
> 2015-03-13 18:43:11.990251 7f29d1e98780 10 filestore oid: 
> 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
> 2015-03-13 18:43:11.990263 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:15.404558 7f29c4439700 20 
> filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000217
> 2015-03-13 18:43:15.404600 7f29c4439700 10 journal commit_start 
> max_applied_seq 13288339, open_ops 1
> 2015-03-13 18:43:15.404603 7f29c4439700 10 journal commit_start waiting for 1 
> open ops to drain
>
> What might this 'open op' mean when it never seems to finish 'draining'? 
> Could my suspicions be true that it's somehow a BTRFS funny?

Well, this line:

> 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 13288339 
> open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339

is the system recognizing that an op has finished. Is that last one
not followed by a similar line?

In any case, yes, btrfs misbehaving stupendously badly could cause
things to hang, if eg it's just not finishing a write that the OSD is
trying to put in. Although I'd naively expect some dmesg output if
that were the case.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Yehuda Sadeh-Weinraub



- Original Message -
> From: "Craig Lewis" 
> To: "Gregory Farnum" 
> Cc: ceph-users@lists.ceph.com
> Sent: Monday, March 16, 2015 11:48:15 AM
> Subject: Re: [ceph-users] RadosGW Direct Upload Limitation
> 
> 
> 
> 
> Maybe, but I'm not sure if Yehuda would want to take it upstream or
> not. This limit is present because it's part of the S3 spec. For
> larger objects you should use multi-part upload, which can get much
> bigger.
> -Greg
> 
> 
> Note that the multi-part upload has a lower limit of 4MiB per part, and the
> direct upload has an upper limit of 5GiB.

The limit is 10MB, but it does not apply to the last part, so basically you 
could upload any object size with it. I would still recommend using the plain 
upload for smaller object sizes, it is faster, and the resulting object might 
be more efficient (for really small sizes).

Yehuda

> 
> So you have to use both methods - direct upload for small files, and
> multi-part upload for big files.
> 
> Your best bet is to use the Amazon S3 libraries. They have functions that
> take care of it for you.
> 
> 
> I'd like to see this mentioned in the Ceph documentation someplace. When I
> first encountered the issue, I couldn't find a limit in the RadosGW
> documentation anywhere. I only found the 5GiB limit in the Amazon API
> documentation, which lead me to test on RadosGW. Now that I know it was done
> to preserve Amazon compatibility, I don't want to override the value
> anymore.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CephFS unexplained writes

2015-03-16 Thread Erik Logtenberg

Hi,

I am getting relatively bad performance from cephfs. I use a replicated
cache pool on ssd in front of an erasure coded pool on rotating media.

When reading big files (streaming video), I see a lot of disk i/o,
especially writes. I have no clue what could cause these writes. The
writes are going to the hdd's and they stop when I stop reading.

I mounted everything with noatime and nodiratime so it shouldn't be
that. On a related note, the Cephfs metadata is stored on ssd too, so
metadata-related changes shouldn't hit the hdd's anyway I think.

Any thoughts? How can I get more information about what ceph is doing?
Using iotop I only see that the osd processes are busy but it doesn't
give many hints as to what they are doing.

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS unexplained writes

2015-03-16 Thread Erik Logtenberg

Hi,

I forgot to mention: while I am seeing these writes in iotop and
/proc/diskstats for the hdd's, I am -not- seeing any writes in "rados
df" for the pool residing on these disks. There is only one pool active
on the hdd's and according to rados df it is getting zero writes when
I'm just reading big files from cephfs.

So apparently the osd's are doing some non-trivial amount of writing on
their own behalf. What could it be?

Thanks,

Erik.


On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
> Hi,
> 
> I am getting relatively bad performance from cephfs. I use a replicated
> cache pool on ssd in front of an erasure coded pool on rotating media.
> 
> When reading big files (streaming video), I see a lot of disk i/o,
> especially writes. I have no clue what could cause these writes. The
> writes are going to the hdd's and they stop when I stop reading.
> 
> I mounted everything with noatime and nodiratime so it shouldn't be
> that. On a related note, the Cephfs metadata is stored on ssd too, so
> metadata-related changes shouldn't hit the hdd's anyway I think.
> 
> Any thoughts? How can I get more information about what ceph is doing?
> Using iotop I only see that the osd processes are busy but it doesn't
> give many hints as to what they are doing.
> 
> Thanks,
> 
> Erik.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS unexplained writes

2015-03-16 Thread Gregory Farnum

The information you're giving sounds a little contradictory, but my
guess is that you're seeing the impacts of object promotion and
flushing. You can sample the operations the OSDs are doing at any
given time by running ops_in_progress (or similar, I forget exact
phrasing) command on the OSD admin socket. I'm not sure if "rados df"
is going to report cache movement activity or not.

That though would mostly be written to the SSDs, not the hard drives —
although the hard drives could still get metadata updates written when
objects are flushed. What data exactly are you seeing that's leading
you to believe writes are happening against these drives? What is the
exact CephFS and cache pool configuration?
-Greg

On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg  wrote:
> Hi,
>
> I forgot to mention: while I am seeing these writes in iotop and
> /proc/diskstats for the hdd's, I am -not- seeing any writes in "rados
> df" for the pool residing on these disks. There is only one pool active
> on the hdd's and according to rados df it is getting zero writes when
> I'm just reading big files from cephfs.
>
> So apparently the osd's are doing some non-trivial amount of writing on
> their own behalf. What could it be?
>
> Thanks,
>
> Erik.
>
>
> On 03/16/2015 10:26 PM, Erik Logtenberg wrote:
>> Hi,
>>
>> I am getting relatively bad performance from cephfs. I use a replicated
>> cache pool on ssd in front of an erasure coded pool on rotating media.
>>
>> When reading big files (streaming video), I see a lot of disk i/o,
>> especially writes. I have no clue what could cause these writes. The
>> writes are going to the hdd's and they stop when I stop reading.
>>
>> I mounted everything with noatime and nodiratime so it shouldn't be
>> that. On a related note, the Cephfs metadata is stored on ssd too, so
>> metadata-related changes shouldn't hit the hdd's anyway I think.
>>
>> Any thoughts? How can I get more information about what ceph is doing?
>> Using iotop I only see that the osd processes are busy but it doesn't
>> give many hints as to what they are doing.
>>
>> Thanks,
>>
>> Erik.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Shadow files

2015-03-16 Thread Ben


Thats the thing. The peaks and troughs are in USERS BUCKETS only.
The actual cluster usage does not go up and down, it just goes up up up.

I would expect to see peaks and troughs much the same as the user 
buckets peaks and troughs on the overall cluster disk usage.

But this is not the case.

We upgraded the cluster and radosgws to GIANT (0.87.1) yesterday, and 
now we are seeing a large number of misplaced(??) objects being moved 
around.
Does this mean it has found all the shadow files that shouldn't exist 
anymore, and is deleting them? If so I would expect to start seeing 
overall cluster usage drop, but this hasn't happened yet.


Any ideas?

On 2015-03-17 06:12, Craig Lewis wrote:

Out of curiousity, what's the frequency of the peaks and troughs?

RadosGW has configs on how long it should wait after deleting before
garbage collecting, how long between GC runs, and how many objects it
can GC in per run.

The defaults are 2 hours, 1 hour, and 32 respectively.  Search
http://docs.ceph.com/docs/master/radosgw/config-ref/ [2] for "rgw gc".

If your peaks and troughs have a frequency less than 1 hour, then GC
is going to delay and alias the disk usage w.r.t. the object count.

If you have millions of objects, you probably need to tweak those
values.  If RGW is only GCing 32 objects an hour, it's never going to
catch up.

Now that I think about it, I bet I'm having issues here too.  I delete
more than (32*24) objects per day...

On Sun, Mar 15, 2015 at 4:41 PM, Ben  wrote:


It is either a problem with CEPH, Civetweb or something else in our
configuration.
But deletes in user buckets is still leaving a high number of old
shadow files. Since we have millions and millions of objects, it is
hard to reconcile what should and shouldnt exist.

Looking at our cluster usage, there are no troughs, it is just a
rising peak.
But when looking at users data usage, we can see peaks and troughs
as you would expect as data is deleted and added.

Our ceph version 0.80.9

Please ideas?

On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote:

- Original Message -
From: "Ben" 
To: ceph-us...@ceph.com
Sent: Wednesday, March 11, 2015 8:46:25 PM
Subject: Re: [ceph-users] Shadow files

Anyone got any info on this?

Is it safe to delete shadow files?

It depends. Shadow files are badly named objects that represent
part
of the objects data. They are only safe to remove if you know that
the
corresponding objects no longer exist.

Yehuda

On 2015-03-11 10:03, Ben wrote:

We have a large number of shadow files in our cluster that aren't
being deleted automatically as data is deleted.

Is it safe to delete these files?
Is there something we need to be aware of when deleting them?
Is there a script that we can run that will delete these safely?

Is there something wrong with our cluster that it isn't deleting

these

files when it should be?

We are using civetweb with radosgw, with tengine ssl proxy

infront of

it

Any advice please
Thanks

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]



Links:
--
[1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[2] http://docs.ceph.com/docs/master/radosgw/config-ref/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Nick Fisk





> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Gregory Farnum
> Sent: 16 March 2015 17:33
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
> sync?
> 
> On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk  wrote:
> >
> > I’m not sure if it’s something I’m doing wrong or just experiencing an
> oddity, but when my cache tier flushes dirty blocks out to the base tier, the
> writes seem to hit the OSD’s straight away instead of coalescing in the
> journals, is this correct?
> >
> > For example if I create a RBD on a standard 3 way replica pool and run fio
> via librbd 128k writes, I see the journals take all the io’s until I hit my
> filestore_min_sync_interval and then I see it start writing to the underlying
> disks.
> >
> > Doing the same on a full cache tier (to force flushing)  I immediately see 
> > the
> base disks at a very high utilisation. The journals also have some write IO at
> the same time. The only other odd thing I can see via iostat is that most of
> the time whilst I’m running Fio, is that I can see the underlying disks doing
> very small write IO’s of around 16kb with an occasional big burst of activity.
> >
> > I know erasure coding+cache tier is slower than just plain replicated pools,
> but even with various high queue depths I’m struggling to get much above
> 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
> 1500. The base tier is comprised of 40 disks. It seems quite a marked
> difference and I’m wondering if this strange journal behaviour is the cause.
> >
> > Does anyone have any ideas?
> 
> If you're running a full cache pool, then on every operation touching an
> object which isn't in the cache pool it will try and evict an object. That's
> probably what you're seeing.
> 
> Cache pool in general are only a wise idea if you have a very skewed
> distribution of data "hotness" and the entire hot zone can fit in cache at
> once.
> -Greg

Hi Greg,

It's not the caching behaviour that I confused about, it’s the journal 
behaviour on the base disks during flushing. I've been doing some more tests 
and can do something reproducible which seems strange to me. 

First off 10MB of 4kb writes:
time ceph tell osd.1 bench 1000 4096
{ "bytes_written": 1000,
  "blocksize": 4096,
  "bytes_per_sec": "16009426.00"}

real0m0.760s
user0m0.063s
sys 0m0.022s

Now split this into 2x5mb writes:
time ceph tell osd.1 bench 500 4096 &&  time ceph tell osd.1 bench 500 
4096
{ "bytes_written": 500,
  "blocksize": 4096,
  "bytes_per_sec": "10580846.00"}

real0m0.595s
user0m0.065s
sys 0m0.018s
{ "bytes_written": 500,
  "blocksize": 4096,
  "bytes_per_sec": "9944252.00"}

real0m4.412s
user0m0.053s
sys 0m0.071s

2nd bench takes a lot longer even though both should easily fit in the 5GB 
journal. Looking at iostat, I think I can see that no writes happen to the 
journal whilst the writes from the 1st bench are being flushed. Is this the 
expected behaviour? I would have thought as long as there is space available in 
the journal it shouldn't block on new writes. Also I see in iostat writes to 
the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
seconds, with a large blip or activity just before the flush finishes. Is this 
the correct behaviour? I would have thought if this "tell osd bench" is doing 
sequential IO then the journal should be able to flush 5-10mb of data in a 
fraction a second.

Ceph.conf
[osd]
filestore max sync interval = 30
filestore min sync interval = 20
filestore flusher = false
osd_journal_size = 5120
osd_crush_location_hook = /usr/local/bin/crush-location
osd_op_threads = 5
filestore_op_threads = 4


iostat during period where writes seem to be blocked (journal=sda disk=sdd)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.002.00 0.00 4.00 4.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdd   0.00 0.000.00   76.00 0.00   760.0020.00 
0.99   13.110.00   13.11  13.05  99.20

iostat during what I believe to be the actual flush

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.002.00 0.00 4.00 4.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum

Nothing here particularly surprises me. I don't remember all the
details of the filestore's rate limiting off the top of my head, but
it goes to great lengths to try and avoid letting the journal get too
far ahead of the backing store. Disabling the filestore flusher and
increasing the sync intervals without also increasing the
filestore_wbthrottle_* limits is not going to work well for you.
-Greg

On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk  wrote:
>
>
>
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Gregory Farnum
>> Sent: 16 March 2015 17:33
>> To: Nick Fisk
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
>> sync?
>>
>> On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk  wrote:
>> >
>> > I’m not sure if it’s something I’m doing wrong or just experiencing an
>> oddity, but when my cache tier flushes dirty blocks out to the base tier, the
>> writes seem to hit the OSD’s straight away instead of coalescing in the
>> journals, is this correct?
>> >
>> > For example if I create a RBD on a standard 3 way replica pool and run fio
>> via librbd 128k writes, I see the journals take all the io’s until I hit my
>> filestore_min_sync_interval and then I see it start writing to the underlying
>> disks.
>> >
>> > Doing the same on a full cache tier (to force flushing)  I immediately see 
>> > the
>> base disks at a very high utilisation. The journals also have some write IO 
>> at
>> the same time. The only other odd thing I can see via iostat is that most of
>> the time whilst I’m running Fio, is that I can see the underlying disks doing
>> very small write IO’s of around 16kb with an occasional big burst of 
>> activity.
>> >
>> > I know erasure coding+cache tier is slower than just plain replicated 
>> > pools,
>> but even with various high queue depths I’m struggling to get much above
>> 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
>> 1500. The base tier is comprised of 40 disks. It seems quite a marked
>> difference and I’m wondering if this strange journal behaviour is the cause.
>> >
>> > Does anyone have any ideas?
>>
>> If you're running a full cache pool, then on every operation touching an
>> object which isn't in the cache pool it will try and evict an object. That's
>> probably what you're seeing.
>>
>> Cache pool in general are only a wise idea if you have a very skewed
>> distribution of data "hotness" and the entire hot zone can fit in cache at
>> once.
>> -Greg
>
> Hi Greg,
>
> It's not the caching behaviour that I confused about, it’s the journal 
> behaviour on the base disks during flushing. I've been doing some more tests 
> and can do something reproducible which seems strange to me.
>
> First off 10MB of 4kb writes:
> time ceph tell osd.1 bench 1000 4096
> { "bytes_written": 1000,
>   "blocksize": 4096,
>   "bytes_per_sec": "16009426.00"}
>
> real0m0.760s
> user0m0.063s
> sys 0m0.022s
>
> Now split this into 2x5mb writes:
> time ceph tell osd.1 bench 500 4096 &&  time ceph tell osd.1 bench 
> 500 4096
> { "bytes_written": 500,
>   "blocksize": 4096,
>   "bytes_per_sec": "10580846.00"}
>
> real0m0.595s
> user0m0.065s
> sys 0m0.018s
> { "bytes_written": 500,
>   "blocksize": 4096,
>   "bytes_per_sec": "9944252.00"}
>
> real0m4.412s
> user0m0.053s
> sys 0m0.071s
>
> 2nd bench takes a lot longer even though both should easily fit in the 5GB 
> journal. Looking at iostat, I think I can see that no writes happen to the 
> journal whilst the writes from the 1st bench are being flushed. Is this the 
> expected behaviour? I would have thought as long as there is space available 
> in the journal it shouldn't block on new writes. Also I see in iostat writes 
> to the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
> seconds, with a large blip or activity just before the flush finishes. Is 
> this the correct behaviour? I would have thought if this "tell osd bench" is 
> doing sequential IO then the journal should be able to flush 5-10mb of data 
> in a fraction a second.
>
> Ceph.conf
> [osd]
> filestore max sync interval = 30
> filestore min sync interval = 20
> filestore flusher = false
> osd_journal_size = 5120
> osd_crush_location_hook = /usr/local/bin/crush-location
> osd_op_threads = 5
> filestore_op_threads = 4
>
>
> iostat during period where writes seem to be blocked (journal=sda disk=sdd)
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.000.000.00   0.00   0.00
> sdb   0.00 0.000.002.00 0.00 4.00 4.00
>  0.000.000.000.00   0.00   0.00
> sdc   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Christian Balzer

On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote:

> Nothing here particularly surprises me. I don't remember all the
> details of the filestore's rate limiting off the top of my head, but
> it goes to great lengths to try and avoid letting the journal get too
> far ahead of the backing store. Disabling the filestore flusher and
> increasing the sync intervals without also increasing the
> filestore_wbthrottle_* limits is not going to work well for you.
> -Greg
> 
While very true and what I recalled (backing store being kicked off early)
from earlier mails, I think having every last configuration parameter
documented in a way that doesn't reduce people to guesswork would be very
helpful.

For example "filestore_wbthrottle_xfs_inodes_start_flusher" which defaults
to 500. 
Assuming that this means to start flushing once 500 inodes have
accumulated, how would Ceph even know how many inodes are needed for the
data present?

Lastly with these parameters, there is xfs and btrfs incarnations, no
ext4. 
Do the xfs parameters also apply to ext4?

Christian

> On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk  wrote:
> >
> >
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Gregory Farnum
> >> Sent: 16 March 2015 17:33
> >> To: Nick Fisk
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier
> >> journal sync?
> >>
> >> On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk  wrote:
> >> >
> >> > I’m not sure if it’s something I’m doing wrong or just experiencing
> >> > an
> >> oddity, but when my cache tier flushes dirty blocks out to the base
> >> tier, the writes seem to hit the OSD’s straight away instead of
> >> coalescing in the journals, is this correct?
> >> >
> >> > For example if I create a RBD on a standard 3 way replica pool and
> >> > run fio
> >> via librbd 128k writes, I see the journals take all the io’s until I
> >> hit my filestore_min_sync_interval and then I see it start writing to
> >> the underlying disks.
> >> >
> >> > Doing the same on a full cache tier (to force flushing)  I
> >> > immediately see the
> >> base disks at a very high utilisation. The journals also have some
> >> write IO at the same time. The only other odd thing I can see via
> >> iostat is that most of the time whilst I’m running Fio, is that I can
> >> see the underlying disks doing very small write IO’s of around 16kb
> >> with an occasional big burst of activity.
> >> >
> >> > I know erasure coding+cache tier is slower than just plain
> >> > replicated pools,
> >> but even with various high queue depths I’m struggling to get much
> >> above 100-150 iops compared to a 3 way replica pool which can easily
> >> achieve 1000- 1500. The base tier is comprised of 40 disks. It seems
> >> quite a marked difference and I’m wondering if this strange journal
> >> behaviour is the cause.
> >> >
> >> > Does anyone have any ideas?
> >>
> >> If you're running a full cache pool, then on every operation touching
> >> an object which isn't in the cache pool it will try and evict an
> >> object. That's probably what you're seeing.
> >>
> >> Cache pool in general are only a wise idea if you have a very skewed
> >> distribution of data "hotness" and the entire hot zone can fit in
> >> cache at once.
> >> -Greg
> >
> > Hi Greg,
> >
> > It's not the caching behaviour that I confused about, it’s the journal
> > behaviour on the base disks during flushing. I've been doing some more
> > tests and can do something reproducible which seems strange to me.
> >
> > First off 10MB of 4kb writes:
> > time ceph tell osd.1 bench 1000 4096
> > { "bytes_written": 1000,
> >   "blocksize": 4096,
> >   "bytes_per_sec": "16009426.00"}
> >
> > real0m0.760s
> > user0m0.063s
> > sys 0m0.022s
> >
> > Now split this into 2x5mb writes:
> > time ceph tell osd.1 bench 500 4096 &&  time ceph tell osd.1 bench
> > 500 4096 { "bytes_written": 500,
> >   "blocksize": 4096,
> >   "bytes_per_sec": "10580846.00"}
> >
> > real0m0.595s
> > user0m0.065s
> > sys 0m0.018s
> > { "bytes_written": 500,
> >   "blocksize": 4096,
> >   "bytes_per_sec": "9944252.00"}
> >
> > real0m4.412s
> > user0m0.053s
> > sys 0m0.071s
> >
> > 2nd bench takes a lot longer even though both should easily fit in the
> > 5GB journal. Looking at iostat, I think I can see that no writes
> > happen to the journal whilst the writes from the 1st bench are being
> > flushed. Is this the expected behaviour? I would have thought as long
> > as there is space available in the journal it shouldn't block on new
> > writes. Also I see in iostat writes to the underlying disk happening
> > at a QD of 1 and 16kb IO's for a number of seconds, with a large blip
> > or activity just before the flush finishes. Is this the correct
> > behaviour? I would have thought if this "tell osd bench" is doing
> > sequential IO then the journal should be able t

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum

On Mon, Mar 16, 2015 at 4:46 PM, Christian Balzer  wrote:
> On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote:
>
>> Nothing here particularly surprises me. I don't remember all the
>> details of the filestore's rate limiting off the top of my head, but
>> it goes to great lengths to try and avoid letting the journal get too
>> far ahead of the backing store. Disabling the filestore flusher and
>> increasing the sync intervals without also increasing the
>> filestore_wbthrottle_* limits is not going to work well for you.
>> -Greg
>>
> While very true and what I recalled (backing store being kicked off early)
> from earlier mails, I think having every last configuration parameter
> documented in a way that doesn't reduce people to guesswork would be very
> helpful.

PRs welcome! ;)

More seriously, we create a lot of config options and it's not always
clear when doing so which ones should be changed by users or not. And
a lot of them (case in point: anything to do with changing journal and
FS interactions) should only be changed by people who really
understand them, because it's possible (as evidenced) to really bust
up your cluster's performance enough that it's basically broken.
Historically that's meant "people who can read the code and understand
it", although we might now have enough people at a mid-line that it's
worth going back and documenting. There's not a lot of pressure coming
from anybody to do that work in comparison to other stuff like "make
CephFS supported" and "make RADOS faster" though, for understandable
reasons. So while we can try and document these things some in future,
the names of things here are really pretty self-explanatory and the
sort of configuration reference guide  I think you're asking for (ie,
"here are all the settings to change if you are running on SSDs, and
here's how they're related") is not the kind of thing that developers
produce. That comes out of the community or is produced by support
contracts.

...so I guess I've circled back around to "PRs welcome!"

> For example "filestore_wbthrottle_xfs_inodes_start_flusher" which defaults
> to 500.
> Assuming that this means to start flushing once 500 inodes have
> accumulated, how would Ceph even know how many inodes are needed for the
> data present?

Number of dirtied objects, of course.

>
> Lastly with these parameters, there is xfs and btrfs incarnations, no
> ext4.
> Do the xfs parameters also apply to ext4?

Uh, looks like it does, but I'm just skimming source right now so you
should check if you change these params. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph.conf

2015-03-16 Thread Jesus Chavez (jeschave)

Hi all I have seen that new versions of CEPH with new OS like RHEL7 and Cento7 
doesn’t need information like mon.node1 and osd.0 etc.. anymore, can anybody 
tell me if is that for real? or do I need still need to write config like this:

[osd.0]
  host = sagitario
  addr = 192.168.1.67
[mon.leo]
  host = leo
  mon addr = 192.168.1.81:6789


[cid:image005.png@01D00809.A6D502D0]


Jesus Chavez
SYSTEMS ENGINEER-C.SALES

jesch...@cisco.com
Phone: +52 55 5267 3146
Mobile: +51 1 5538883255

CCIE - 44433


Cisco.com





[cid:image006.gif@01D00809.A6D502D0]



  Think before you print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.

Please click 
here for 
Company Registration Information.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: delayed objects deletion ?

2015-03-16 Thread Yan, Zheng

On Mon, Mar 16, 2015 at 5:08 PM, Florent B  wrote:
> Since then I deleted the pool.
>
> But I now have another problem, in fact the "opposite" of the previous :
> now I never deleted files in clients, data objects and metadata are
> still in pools, but directory is empty for clients (it is another
> directory, other pool, etc. from previous problem).
>
> Here are logs from MDS when I restart it about one of the files :
>
> 2015-03-16 09:57:48.626254 7f4177694700 12 mds.0.cache.dir(1a95e05)
> link_primary_inode [dentry #1/staging/api/easyrsa/vars [2,head] auth
> NULL (dversion lock) v=22 inode=0 | dirty=1 0x6ca5a20] [inode
> 1a95e11 [2,head] #1a95e11 auth v22 s=0 n(v0 1=1+0) (iversion
> lock) cr={29050627=0-1966080@1} 0x53c32c8]
> 2015-03-16 09:57:48.626258 7f4177694700 10 mds.0.journal
> EMetaBlob.replay added [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v22 s=0 n(v0 1=1+0) (iversion lock)
> cr={29050627=0-1966080@1} 0x53c32c8]
> 2015-03-16 09:57:48.626260 7f4177694700 10 mds.0.cache.ino(1a95e11)
> mark_dirty_parent
> 2015-03-16 09:57:48.626261 7f4177694700 10 mds.0.journal
> EMetaBlob.replay noting opened inode [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v22 dirtyparent s=0 n(v0 1=1+0) (iversion
> lock) cr={29050627=0-1966080@1} | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:48.626264 7f4177694700 10 mds.0.journal
> EMetaBlob.replay sessionmap v 21580500 -(1|2) == table 21580499 prealloc
> [] used 1a95e11
> 2015-03-16 09:57:48.626265 7f4177694700 20 mds.0.journal  (session
> prealloc [1a95e11~3dd])
> 2015-03-16 09:57:48.626843 7f4177694700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v42 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:48.629319 7f4177694700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v99 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:48.629357 7f4177694700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v101 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:48.636559 7f4177694700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v164 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:48.636597 7f4177694700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v166 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:48.644280 7f4177694700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v227 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:48.644318 7f4177694700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:51.911267 7f417c9a1700 15 mds.0.cache  chose lock
> states on [inode 1a95e11 [2,head] /staging/api/easyrsa/vars auth
> v229 dirtyparent s=8089 n(v0 b8089 1=1+0) (iversion lock) |
> dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:51.916816 7f417c9a1700 20 mds.0.locker
> check_inode_max_size no-op on [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:51.958925 7f417c9a1700  7 mds.0.cache inode [inode
> 1a95e11 [2,head] /staging/api/easyrsa/vars auth v229 dirtyparent
> s=8089 n(v0 b8089 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
> 2015-03-16 09:57:56.561404 7f417c9a1700 10 mds.0.cache  unlisting
> unwanted/capless inode [inode 1a95e11 [2,head]
> /staging/api/easyrsa/vars auth v229 dirtyparent s=8089 n(v0 b8089 1=1+0)
> (iversion lock) | dirtyparent=1 dirty=1 0x53c32c8]
>
>

this log message is not for deleted files. Could you try again and upload the
log file and output of "rados -p data ls" to somewhere.

Regards
Yan, Zheng


> What is going on ?
>
> On 03/16/2015 02:18 AM, Yan, Zheng wrote:
>> I don't know what was wrong. could you use "rados -p data ls" to check
>> which objects still exist. Then restart the mds MDS with debug_mds=20
>> and search the log for name of the remaining objects.
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] SUBSCRIBE

2015-03-16 Thread 谢锐

SUBSCRIBE  ceph-users
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

49 matches

Mail list logo