Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Jan Schermer
I understand this. But the clients can't request something that doesn't fit a 
(POSIX) filesystem capabilities. That means the requests can map 1:1 into the 
filestore (O_FSYNC from client == O_FSYNC on the filestore object... ). 
Pagecache/io-schedulers are already smart enough to merge requests, preserve 
ordering - they just do the right thing already. It's true that in a 
distributed environment one async request can map to one OSD and then a 
synchronous one comes and needs the first one to be flushed beforehand, so that 
logic is presumably in place already - but I still don't see much need for a 
journal in there (btw in case of RBD with caching, this logic is probably not 
even needed at all and merging request in RBD cache makes more sense than 
merging somewhere down the line).
It might be faster to merge small writes in journal when the journal is on SSDs 
and filestore on spinning rust, but it will surely be slower (cpu bound by 
ceph-osd?) when the filestore is fast enough or when the merging is not optimal.
I have never touched anything but a pure SSD cluster, though - I have always 
been CPU bound and that's why I started thinking about this in the first place. 
I'd love to have my disks saturated with requests from clients one day.

Don't take this the wrong way, but I've been watching ceph perf talks and stuff 
and haven't seen anything that would make Ceph comparably fast to an ordinary 
SAN/NAS.
Maybe this is a completely wrong idea, I just think it might be worth thinking 
about.

Thanks

Jan


> On 14 Oct 2015, at 20:29, Somnath Roy  wrote:
> 
> FileSystem like XFS guarantees a single file write but in Ceph transaction we 
> are touching file/xattrs/leveldb (omap), so no way filesystem can guarantee 
> that transaction. That's why FileStore has implemented a write_ahead journal. 
> Basically, it is writing the entire transaction object there and only 
> trimming from journal when it is actually applied (all the operation 
> executed) and persisted in the backend. 
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Jan Schermer [mailto:j...@schermer.cz] 
> Sent: Wednesday, October 14, 2015 9:06 AM
> To: Somnath Roy
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
> 
> But that's exactly what filesystems and their own journals do already :-)
> 
> Jan
> 
>> On 14 Oct 2015, at 17:02, Somnath Roy  wrote:
>> 
>> Jan,
>> Journal helps FileStore to maintain the transactional integrity in the event 
>> of a crash. That's the main reason.
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
>> Schermer
>> Sent: Wednesday, October 14, 2015 2:28 AM
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>> 
>> Hi,
>> I've been thinking about this for a while now - does Ceph really need a 
>> journal? Filesystems are already pretty good at committing data to disk when 
>> asked (and much faster too), we have external journals in XFS and Ext4...
>> In a scenario where client does an ordinary write, there's no need to flush 
>> it anywhere (the app didn't ask for it) so it ends up in pagecache and gets 
>> committed eventually.
>> If a client asks for the data to be flushed then fdatasync/fsync on the 
>> filestore object takes care of that, including ordering and stuff.
>> For reads, you just read from filestore (no need to differentiate between 
>> filestore/journal) - pagecache gives you the right version already.
>> 
>> Or is journal there to achieve some tiering for writes when the running 
>> spindles with SSDs? This is IMO the only thing ordinary filesystems don't do 
>> out of box even when filesystem journal is put on SSD - the data get flushed 
>> to spindle whenever fsync-ed (even with data=journal). But in reality, most 
>> of the data will hit the spindle either way and when you run with SSDs it 
>> will always be much slower. And even for tiering - there are already many 
>> options (bcache, flashcache or even ZFS L2ARC) that are much more performant 
>> and proven stable. I think the fact that people  have a need to combine Ceph 
>> with stuff like that already proves the point.
>> 
>> So a very interesting scenario would be to disable Ceph journal and at most 
>> use data=journal on ext4. The complexity of the data path would drop 
>> significantly, latencies decrease, CPU time is saved...
>> I just feel that Ceph has lots of unnecessary complexity inside that 
>> duplicates what filesystems (and pagecache...) have been doing for a while 
>> now without eating most of our CPU cores - why don't we use that? Is it 
>> possible to disable journal completely?
>> 
>> Did I miss something that makes journal essential?
>> 
>> Jan
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com

Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Burkhard Linke

Hi,

On 10/19/2015 05:27 AM, Yan, Zheng wrote:

On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
 wrote:

Hi,

I've noticed that CephFS (both ceph-fuse and kernel client in version 4.2.3)
remove files from page cache as soon as they are not in use by a process
anymore.

Is this intended behaviour? We use CephFS as a replacement for NFS in our
HPC cluster. It should serve large files which are read by multiple jobs on
multiple hosts, so keeping them in the page cache over the duration of
several job invocations is crucial.

Yes. MDS needs resource to track the cached data. We don't want MDS
use too much resource.


Mount options are defaults,noatime,_netdev (+ extra options for the kernel
client). Is there an option to keep data in page cache just like any other
filesystem?

So far there is no option to do that. Later, we may add an option to
keep the cached data for a few seconds.


This renders CephFS useless for almost any HPC cluster application. And 
keeping data for a few seconds is not a solution in most cases.


CephFS supports capabilities to manages access to objects, enforce 
consistency of data etc. IMHO a sane way to handle the page cache is use 
a capability to inform the mds about caches objects; as long as no other 
client claims write access to an object or its metadata, the cache copy 
is considered consistent. Upon write access the client should drop the 
capability (and thus remove the object from the page cache). If another 
process tries to access a cache object with intact 'cache' capability, 
it may be promoted to read/write capability.


I haven't dug into the details of either capabilities or kernel page 
cache, but the method described above should be very similar to the 
existing read only capability. I don't know whether there's a kind of 
eviction callback in the page cache that cephfs can use to update 
capabilities if an object is removed from the page cache (e.g. due to 
memory pressure), but I'm pretty sure that other filesystems like NFS 
also need to keep track of what's cached.


This approach will probably increase the resources for both MDS and 
cephfs clients, but the benefits are obvious. For use cases with limited 
resource the MDS may refuse the 'cache' capability to client to reduce 
the memory footprint.


Just my 2 ct and regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Shinobu Kinjo
What kind of applications are you talking about regarding to applications
for HPC.

Are you talking about like netcdf?

Caching is quite necessary for some applications for computation.
But it's not always the case.

It's not quite related to this topic but I'm really interested in your
thought using Ceph cluster for HPC computation.

Shinobu 

- Original Message -
From: "Burkhard Linke" 
To: ceph-users@lists.ceph.com
Sent: Monday, October 19, 2015 4:59:21 PM
Subject: Re: [ceph-users] CephFS and page cache

Hi,

On 10/19/2015 05:27 AM, Yan, Zheng wrote:
> On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
>  wrote:
>> Hi,
>>
>> I've noticed that CephFS (both ceph-fuse and kernel client in version 4.2.3)
>> remove files from page cache as soon as they are not in use by a process
>> anymore.
>>
>> Is this intended behaviour? We use CephFS as a replacement for NFS in our
>> HPC cluster. It should serve large files which are read by multiple jobs on
>> multiple hosts, so keeping them in the page cache over the duration of
>> several job invocations is crucial.
> Yes. MDS needs resource to track the cached data. We don't want MDS
> use too much resource.
>
>> Mount options are defaults,noatime,_netdev (+ extra options for the kernel
>> client). Is there an option to keep data in page cache just like any other
>> filesystem?
> So far there is no option to do that. Later, we may add an option to
> keep the cached data for a few seconds.

This renders CephFS useless for almost any HPC cluster application. And 
keeping data for a few seconds is not a solution in most cases.

CephFS supports capabilities to manages access to objects, enforce 
consistency of data etc. IMHO a sane way to handle the page cache is use 
a capability to inform the mds about caches objects; as long as no other 
client claims write access to an object or its metadata, the cache copy 
is considered consistent. Upon write access the client should drop the 
capability (and thus remove the object from the page cache). If another 
process tries to access a cache object with intact 'cache' capability, 
it may be promoted to read/write capability.

I haven't dug into the details of either capabilities or kernel page 
cache, but the method described above should be very similar to the 
existing read only capability. I don't know whether there's a kind of 
eviction callback in the page cache that cephfs can use to update 
capabilities if an object is removed from the page cache (e.g. due to 
memory pressure), but I'm pretty sure that other filesystems like NFS 
also need to keep track of what's cached.

This approach will probably increase the resources for both MDS and 
cephfs clients, but the benefits are obvious. For use cases with limited 
resource the MDS may refuse the 'cache' capability to client to reduce 
the memory footprint.

Just my 2 ct and regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Burkhard Linke

Hi,

On 10/19/2015 10:34 AM, Shinobu Kinjo wrote:

What kind of applications are you talking about regarding to applications
for HPC.

Are you talking about like netcdf?

Caching is quite necessary for some applications for computation.
But it's not always the case.

It's not quite related to this topic but I'm really interested in your
thought using Ceph cluster for HPC computation.
Our application are in the field of bioinformatics. This involves read 
mapping, homology search in databases etc.


In almost all cases there's a fixed dataset or database like the human 
genome with all read mapping index files (> 20GB) or the database with 
all known protein sequences (>25 GB). With enough RAM in the cluster 
machines most of these datasets can be keep in memory for subsequent 
processing runs.


These datasets are updated from time to time, so keeping them on a 
network storage is simplier than distributing updates to instances on 
local hard disks. It would also require intensive interaction with the 
queuing system to ensure that one job array operates on a consistent 
datasets. It worked fine with NFS based storage, but NFS introduces a 
single point of failure (except for pNFS).


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread Bharath Krishna
Hi

What happens when Cinder service with CEPH backend storage cluster capacity is 
FULL?

What would be the out come of new cinder create volume request?

Will volume be created with space not available for use or an error thrown from 
Cinder API stating no space available for new volume.

I could not try this in my environment and fill up the cluster.

Please reply if you have ever tried and tested this.

Thank you.

Regards,
M Bharath Krishna
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread Jan Schermer
Do you mean when the CEPH cluster (OSDs) is physically full or when the quota 
is reached?

If CEPH becomes full it just stalls all IO (maybe just write IO, but 
effectively same thing) - not pretty and you must never ever let it become full.

Jan


> On 19 Oct 2015, at 11:15, Bharath Krishna  wrote:
> 
> Hi
> 
> What happens when Cinder service with CEPH backend storage cluster capacity 
> is FULL?
> 
> What would be the out come of new cinder create volume request?
> 
> Will volume be created with space not available for use or an error thrown 
> from Cinder API stating no space available for new volume.
> 
> I could not try this in my environment and fill up the cluster.
> 
> Please reply if you have ever tried and tested this.
> 
> Thank you.
> 
> Regards,
> M Bharath Krishna
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread Bharath Krishna
I mean cluster OSDs are physically full.

I understand its not a pretty way to operate CEPH allowing to become full,
but I just wanted to know the boundary condition if it becomes full.

Will cinder create volume operation creates new volume at all or error is
thrown at Cinder API level itself stating that no space available?

When IO stalls, will I be able to read the data from CEPH cluster I.e can
I still read data from existing volumes created from CEPH cluster?

Thanks for the quick reply.

Regards
M Bharath Krishna

On 10/19/15, 2:51 PM, "Jan Schermer"  wrote:

>Do you mean when the CEPH cluster (OSDs) is physically full or when the
>quota is reached?
>
>If CEPH becomes full it just stalls all IO (maybe just write IO, but
>effectively same thing) - not pretty and you must never ever let it
>become full.
>
>Jan
>
>
>> On 19 Oct 2015, at 11:15, Bharath Krishna 
>>wrote:
>> 
>> Hi
>> 
>> What happens when Cinder service with CEPH backend storage cluster
>>capacity is FULL?
>> 
>> What would be the out come of new cinder create volume request?
>> 
>> Will volume be created with space not available for use or an error
>>thrown from Cinder API stating no space available for new volume.
>> 
>> I could not try this in my environment and fill up the cluster.
>> 
>> Please reply if you have ever tried and tested this.
>> 
>> Thank you.
>> 
>> Regards,
>> M Bharath Krishna
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread Jan Schermer
It happened to me once but I didn't really have any time to investigate how 
exactly it behaves. Some VMs had to be rebooted, other VMs survived but I can't 
tell if for example rewriting the same block is possible.
Only writes should block in any case.

I don't know what happens to Cinder, but I don't expect it to work - it will 
either timeout or/then fail with 5xx error.

Jan

> On 19 Oct 2015, at 11:32, Bharath Krishna  wrote:
> 
> I mean cluster OSDs are physically full.
> 
> I understand its not a pretty way to operate CEPH allowing to become full,
> but I just wanted to know the boundary condition if it becomes full.
> 
> Will cinder create volume operation creates new volume at all or error is
> thrown at Cinder API level itself stating that no space available?
> 
> When IO stalls, will I be able to read the data from CEPH cluster I.e can
> I still read data from existing volumes created from CEPH cluster?
> 
> Thanks for the quick reply.
> 
> Regards
> M Bharath Krishna
> 
> On 10/19/15, 2:51 PM, "Jan Schermer"  wrote:
> 
>> Do you mean when the CEPH cluster (OSDs) is physically full or when the
>> quota is reached?
>> 
>> If CEPH becomes full it just stalls all IO (maybe just write IO, but
>> effectively same thing) - not pretty and you must never ever let it
>> become full.
>> 
>> Jan
>> 
>> 
>>> On 19 Oct 2015, at 11:15, Bharath Krishna 
>>> wrote:
>>> 
>>> Hi
>>> 
>>> What happens when Cinder service with CEPH backend storage cluster
>>> capacity is FULL?
>>> 
>>> What would be the out come of new cinder create volume request?
>>> 
>>> Will volume be created with space not available for use or an error
>>> thrown from Cinder API stating no space available for new volume.
>>> 
>>> I could not try this in my environment and fill up the cluster.
>>> 
>>> Please reply if you have ever tried and tested this.
>>> 
>>> Thank you.
>>> 
>>> Regards,
>>> M Bharath Krishna
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread Bharath Krishna
Thanks Jan!!

Cheers
Bharath

On 10/19/15, 3:17 PM, "Jan Schermer"  wrote:

>It happened to me once but I didn't really have any time to investigate
>how exactly it behaves. Some VMs had to be rebooted, other VMs survived
>but I can't tell if for example rewriting the same block is possible.
>Only writes should block in any case.
>
>I don't know what happens to Cinder, but I don't expect it to work - it
>will either timeout or/then fail with 5xx error.
>
>Jan
>
>> On 19 Oct 2015, at 11:32, Bharath Krishna 
>>wrote:
>> 
>> I mean cluster OSDs are physically full.
>> 
>> I understand its not a pretty way to operate CEPH allowing to become
>>full,
>> but I just wanted to know the boundary condition if it becomes full.
>> 
>> Will cinder create volume operation creates new volume at all or error
>>is
>> thrown at Cinder API level itself stating that no space available?
>> 
>> When IO stalls, will I be able to read the data from CEPH cluster I.e
>>can
>> I still read data from existing volumes created from CEPH cluster?
>> 
>> Thanks for the quick reply.
>> 
>> Regards
>> M Bharath Krishna
>> 
>> On 10/19/15, 2:51 PM, "Jan Schermer"  wrote:
>> 
>>> Do you mean when the CEPH cluster (OSDs) is physically full or when the
>>> quota is reached?
>>> 
>>> If CEPH becomes full it just stalls all IO (maybe just write IO, but
>>> effectively same thing) - not pretty and you must never ever let it
>>> become full.
>>> 
>>> Jan
>>> 
>>> 
 On 19 Oct 2015, at 11:15, Bharath Krishna 
 wrote:
 
 Hi
 
 What happens when Cinder service with CEPH backend storage cluster
 capacity is FULL?
 
 What would be the out come of new cinder create volume request?
 
 Will volume be created with space not available for use or an error
 thrown from Cinder API stating no space available for new volume.
 
 I could not try this in my environment and fill up the cluster.
 
 Please reply if you have ever tried and tested this.
 
 Thank you.
 
 Regards,
 M Bharath Krishna
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread John Spray
On Mon, Oct 19, 2015 at 8:59 AM, Burkhard Linke
 wrote:
> Hi,
>
> On 10/19/2015 05:27 AM, Yan, Zheng wrote:
>>
>> On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
>>  wrote:
>>>
>>> Hi,
>>>
>>> I've noticed that CephFS (both ceph-fuse and kernel client in version
>>> 4.2.3)
>>> remove files from page cache as soon as they are not in use by a process
>>> anymore.
>>>
>>> Is this intended behaviour? We use CephFS as a replacement for NFS in our
>>> HPC cluster. It should serve large files which are read by multiple jobs
>>> on
>>> multiple hosts, so keeping them in the page cache over the duration of
>>> several job invocations is crucial.
>>
>> Yes. MDS needs resource to track the cached data. We don't want MDS
>> use too much resource.
>>
>>> Mount options are defaults,noatime,_netdev (+ extra options for the
>>> kernel
>>> client). Is there an option to keep data in page cache just like any
>>> other
>>> filesystem?
>>
>> So far there is no option to do that. Later, we may add an option to
>> keep the cached data for a few seconds.
>
>
> This renders CephFS useless for almost any HPC cluster application. And
> keeping data for a few seconds is not a solution in most cases.

While I appreciate your frustration, that isn't an accurate statement.
For example, many physics HPC workloads use a network filesystem for
snapshotting their progress, where they dump their computed dataset at
regular intervals.  In these instances, having a cache of the data in
the pagecache is rarely if ever useful.

Moreover, in the general case of a shared filesystem with many nodes,
it is not to be assumed that the same client will be accessing the
same data repeatedly: there is an implicit hint in the use of a shared
filesystem that applications are likely to want to access that data
from different nodes, rather than the same node repeatedly.  Clearly
that is by no means true in all cases, but I think you may be
overestimating the generality of your own workload (not that we don't
want to make it faster for you)

> CephFS supports capabilities to manages access to objects, enforce
> consistency of data etc. IMHO a sane way to handle the page cache is use a
> capability to inform the mds about caches objects; as long as no other
> client claims write access to an object or its metadata, the cache copy is
> considered consistent. Upon write access the client should drop the
> capability (and thus remove the object from the page cache). If another
> process tries to access a cache object with intact 'cache' capability, it
> may be promoted to read/write capability.

This is essentially what we already do, except that we pro-actively
drop the capability when files are closed, rather than keeping it
around on the client in case its needed again.

Having those caps linger on a client is a tradeoff:
 * while it makes subsequent cached reads from the original client
nice and fast, it adds latency for any other client that wants to open
the file.
 * It also adds latency for the original client when it wants to open
many other files, because it will have to wait for the original file's
capabilities to be given up before it has room in its metadata cache
to open other files.
 * it also creates confusion if someone opens a big file, then closes
it, then wonders why their ceph-fuse process is still sitting on gigs
of memory

Further, as Zheng pointed out, the design of cephfs requires that
whenever a client has capabilities on a file, it must also be in cache
on the MDSs.  Because there are many more clients than MDSs, clients
keeping comparatively modest numbers of capabilities can cause an much
more significant increase in the burden on the MDSs.  Even if this is
within the MDS cache limit, it still has the downside that it prevents
the MDS from caching other metadata that another client might want to
be using.

So: the key thing to realise is that caching behaviour is full of
tradeoffs, and this is really something that needs to be tunable, so
that it can be adapted to the differing needs of different workloads.
Having an optional "hold onto caps for N seconds after file close"
sounds like it would be the right tunable for your use case, right?

John

> I haven't dug into the details of either capabilities or kernel page cache,
> but the method described above should be very similar to the existing read
> only capability. I don't know whether there's a kind of eviction callback in
> the page cache that cephfs can use to update capabilities if an object is
> removed from the page cache (e.g. due to memory pressure), but I'm pretty
> sure that other filesystems like NFS also need to keep track of what's
> cached.
>
> This approach will probably increase the resources for both MDS and cephfs
> clients, but the benefits are obvious. For use cases with limited resource
> the MDS may refuse the 'cache' capability to client to reduce the memory
> footprint.
>
> Just my 2 ct and regards,
>
> Burkhard
> _

Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread John Spray
On Mon, Oct 19, 2015 at 8:55 AM, Jan Schermer  wrote:
> I understand this. But the clients can't request something that doesn't fit a 
> (POSIX) filesystem capabilities

Actually, clients can.  Clients can request fairly complex operations
like "read an xattr, stop if it's not there, now write the following
discontinuous regions of the file...".  RADOS executes these
transactions atomically.

However, you are correct that for many cases (new files, sequential
writes) it is possible to avoid the double write of data: the
in-development newstore backend does that.  But we still have cases
where we do fancier things than the backend (be it posix, or a KV
store) can handle, so will have non-fast-path higher overhead ways of
handling it.

John

That means the requests can map 1:1 into the filestore (O_FSYNC from
client == O_FSYNC on the filestore object... ).
Pagecache/io-schedulers are already smart enough to merge requests,
preserve ordering - they just do the right thing already. It's true
that in a distributed environment one async request can map to one OSD
and then a synchronous one comes and needs the first one to be flushed
beforehand, so that logic is presumably in place already - but I still
don't see much need for a journal in there (btw in case of RBD with
caching, this logic is probably not even needed at all and merging
request in RBD cache makes more sense than merging somewhere down the
line).
> It might be faster to merge small writes in journal when the journal is on 
> SSDs and filestore on spinning rust, but it will surely be slower (cpu bound 
> by ceph-osd?) when the filestore is fast enough or when the merging is not 
> optimal.
> I have never touched anything but a pure SSD cluster, though - I have always 
> been CPU bound and that's why I started thinking about this in the first 
> place. I'd love to have my disks saturated with requests from clients one day.
>
> Don't take this the wrong way, but I've been watching ceph perf talks and 
> stuff and haven't seen anything that would make Ceph comparably fast to an 
> ordinary SAN/NAS.
> Maybe this is a completely wrong idea, I just think it might be worth 
> thinking about.
>
> Thanks
>
> Jan
>
>
>> On 14 Oct 2015, at 20:29, Somnath Roy  wrote:
>>
>> FileSystem like XFS guarantees a single file write but in Ceph transaction 
>> we are touching file/xattrs/leveldb (omap), so no way filesystem can 
>> guarantee that transaction. That's why FileStore has implemented a 
>> write_ahead journal. Basically, it is writing the entire transaction object 
>> there and only trimming from journal when it is actually applied (all the 
>> operation executed) and persisted in the backend.
>>
>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: Jan Schermer [mailto:j...@schermer.cz]
>> Sent: Wednesday, October 14, 2015 9:06 AM
>> To: Somnath Roy
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>>
>> But that's exactly what filesystems and their own journals do already :-)
>>
>> Jan
>>
>>> On 14 Oct 2015, at 17:02, Somnath Roy  wrote:
>>>
>>> Jan,
>>> Journal helps FileStore to maintain the transactional integrity in the 
>>> event of a crash. That's the main reason.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>>> Jan Schermer
>>> Sent: Wednesday, October 14, 2015 2:28 AM
>>> To: ceph-users@lists.ceph.com
>>> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>>>
>>> Hi,
>>> I've been thinking about this for a while now - does Ceph really need a 
>>> journal? Filesystems are already pretty good at committing data to disk 
>>> when asked (and much faster too), we have external journals in XFS and 
>>> Ext4...
>>> In a scenario where client does an ordinary write, there's no need to flush 
>>> it anywhere (the app didn't ask for it) so it ends up in pagecache and gets 
>>> committed eventually.
>>> If a client asks for the data to be flushed then fdatasync/fsync on the 
>>> filestore object takes care of that, including ordering and stuff.
>>> For reads, you just read from filestore (no need to differentiate between 
>>> filestore/journal) - pagecache gives you the right version already.
>>>
>>> Or is journal there to achieve some tiering for writes when the running 
>>> spindles with SSDs? This is IMO the only thing ordinary filesystems don't 
>>> do out of box even when filesystem journal is put on SSD - the data get 
>>> flushed to spindle whenever fsync-ed (even with data=journal). But in 
>>> reality, most of the data will hit the spindle either way and when you run 
>>> with SSDs it will always be much slower. And even for tiering - there are 
>>> already many options (bcache, flashcache or even ZFS L2ARC) that are much 
>>> more performant and proven stable. I think the fact that people  have a 
>>> need to combine Ceph with stuff like that al

[ceph-users] upgrading from 0.9.3 to 9.1.0 and systemd

2015-10-19 Thread Kenneth Waegeman

Hi all,

I tried upgrading ceph from 0.9.3 to 9.1.0, but ran into some troubles.
I chowned the /var/lib/ceph folder as described in the release notes, 
but my journal is on a seperate partition, so I get:


Oct 19 11:58:59 ceph001.cubone.os systemd[1]: Started Ceph object 
storage daemon.
Oct 19 11:58:59 ceph001.cubone.os ceph-osd[6806]: starting osd.1 at :/0 
osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
Oct 19 11:58:59 ceph001.cubone.os ceph-osd[6806]: 2015-10-19 
11:58:59.530204 7f18aeba8900 -1 filestore(/var/lib/ceph/osd/ceph-1) 
mount failed to open journal /var/lib/ceph/osd/ceph-1/journal: (13) 
Permission den
Oct 19 11:58:59 ceph001.cubone.os ceph-osd[6806]: 2015-10-19 
11:58:59.540355 7f18aeba8900 -1 osd.1 0 OSD:init: unable to mount object 
store
Oct 19 11:58:59 ceph001.cubone.os ceph-osd[6806]: 2015-10-19 
11:58:59.540370 7f18aeba8900 -1  ** ERROR: osd init failed: (13) 
Permission denied
Oct 19 11:58:59 ceph001.cubone.os systemd[1]: ceph-osd@1.service: main 
process exited, code=exited, status=1/FAILURE


Is this a known issue?
I tried chowning the journal partition, without luck, then instead I get 
this:


Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: in thread 7fbb986fe900
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: ceph version 9.1.0 
(3be81ae6cf17fcf689cd6f187c4615249fea4f61)
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 1: (()+0x7e1f22) 
[0x7fbb98ef1f22]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 2: (()+0xf130) 
[0x7fbb97067130]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 3: (gsignal()+0x37) 
[0x7fbb958255d7]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 4: (abort()+0x148) 
[0x7fbb95826cc8]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 5: 
(__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fbb961389b5]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 6: (()+0x5e926) 
[0x7fbb96136926]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 7: (()+0x5e953) 
[0x7fbb96136953]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 8: (()+0x5eb73) 
[0x7fbb96136b73]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 9: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x27a) [0x7fbb98fe766a]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 10: 
(OSDService::get_map(unsigned int)+0x3d) [0x7fbb98a97e2d]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 11: 
(OSD::init()+0xb0b) [0x7fbb98a4bf7b]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 12: (main()+0x2998) 
[0x7fbb989cf3b8]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 13: 
(__libc_start_main()+0xf5) [0x7fbb95811af5]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 14: (()+0x2efb49) 
[0x7fbb989ffb49]
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: NOTE: a copy of the 
executable, or `objdump -rdS ` is needed to interpret this.
Oct 19 12:10:34 ceph001.cubone.os ceph-osd[7763]: 0> 2015-10-19 
12:10:34.710385 7fbb986fe900 -1 *** Caught signal (Aborted) **


So the OSDs do not start..

By the way, is there an easy way to only restart osds, not the mons or 
other daemons as with ceph.target?

Could there be seperate targets for the osd/mon/.. types?

Thanks!

Kenneth
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] nhm ceph is down

2015-10-19 Thread Iezzi, Federico
Hi there,



The content sharing at http://nhm.ceph.com/ is not anymore reachable on 
Internet.

Could you please fix it?



Thanks,

F.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Dan van der Ster
On Mon, Oct 19, 2015 at 12:34 PM, John Spray  wrote:
> On Mon, Oct 19, 2015 at 8:59 AM, Burkhard Linke
>  wrote:
>> Hi,
>>
>> On 10/19/2015 05:27 AM, Yan, Zheng wrote:
>>>
>>> On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
>>>  wrote:

 Hi,

 I've noticed that CephFS (both ceph-fuse and kernel client in version
 4.2.3)
 remove files from page cache as soon as they are not in use by a process
 anymore.

 Is this intended behaviour? We use CephFS as a replacement for NFS in our
 HPC cluster. It should serve large files which are read by multiple jobs
 on
 multiple hosts, so keeping them in the page cache over the duration of
 several job invocations is crucial.
>>>
>>> Yes. MDS needs resource to track the cached data. We don't want MDS
>>> use too much resource.
>>>
 Mount options are defaults,noatime,_netdev (+ extra options for the
 kernel
 client). Is there an option to keep data in page cache just like any
 other
 filesystem?
>>>
>>> So far there is no option to do that. Later, we may add an option to
>>> keep the cached data for a few seconds.
>>
>>
>> This renders CephFS useless for almost any HPC cluster application. And
>> keeping data for a few seconds is not a solution in most cases.
>
> While I appreciate your frustration, that isn't an accurate statement.
> For example, many physics HPC workloads use a network filesystem for
> snapshotting their progress, where they dump their computed dataset at
> regular intervals.  In these instances, having a cache of the data in
> the pagecache is rarely if ever useful.
>
> Moreover, in the general case of a shared filesystem with many nodes,
> it is not to be assumed that the same client will be accessing the
> same data repeatedly: there is an implicit hint in the use of a shared
> filesystem that applications are likely to want to access that data
> from different nodes, rather than the same node repeatedly.  Clearly
> that is by no means true in all cases, but I think you may be
> overestimating the generality of your own workload (not that we don't
> want to make it faster for you)
>

Your assumption doesn't match what I've seen (in high energy physics
(HEP)). The implicit hint you describe is much more apparent when
clients use object storage APIs like S3 or one of the oodles of
network storage systems we use in high energy physics. But NFS-like
shared filesystems are different. This is where we'll put
applications, libraries, configurations, configuration _data_ -- all
things which indeed _are_ likely to be re-used by the same client many
times. Consider these use-cases: a physicist is developing an analysis
which is linked against 100's of headers in CephFS, recompiling many
times, and also 100's of other users doing the same with the same
headers; or a batch processing node is running the same data analysis
code (hundreds/thousands of libraries in CephFS) on different input
files.

Files are re-accessed so often in HEP that we developed a new
immutable-only, cache-forever filesystem for application distribution
(CVMFS). And in places where we use OpenAFS we make use of readonly
replicas to ensure that clients can cache as often as possible.

>> CephFS supports capabilities to manages access to objects, enforce
>> consistency of data etc. IMHO a sane way to handle the page cache is use a
>> capability to inform the mds about caches objects; as long as no other
>> client claims write access to an object or its metadata, the cache copy is
>> considered consistent. Upon write access the client should drop the
>> capability (and thus remove the object from the page cache). If another
>> process tries to access a cache object with intact 'cache' capability, it
>> may be promoted to read/write capability.
>
> This is essentially what we already do, except that we pro-actively
> drop the capability when files are closed, rather than keeping it
> around on the client in case its needed again.
>
> Having those caps linger on a client is a tradeoff:
>  * while it makes subsequent cached reads from the original client
> nice and fast, it adds latency for any other client that wants to open
> the file.
>  * It also adds latency for the original client when it wants to open
> many other files, because it will have to wait for the original file's
> capabilities to be given up before it has room in its metadata cache
> to open other files.
>  * it also creates confusion if someone opens a big file, then closes
> it, then wonders why their ceph-fuse process is still sitting on gigs
> of memory
>
> Further, as Zheng pointed out, the design of cephfs requires that
> whenever a client has capabilities on a file, it must also be in cache
> on the MDSs.  Because there are many more clients than MDSs, clients
> keeping comparatively modest numbers of capabilities can cause an much
> more significant increase in the burden on the MDSs.  Even if this is
> within the MDS cache limit, it still has the do

Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Burkhard Linke

Hi,

On 10/19/2015 12:34 PM, John Spray wrote:

On Mon, Oct 19, 2015 at 8:59 AM, Burkhard Linke
 wrote:

Hi,

On 10/19/2015 05:27 AM, Yan, Zheng wrote:

On Sat, Oct 17, 2015 at 1:42 AM, Burkhard Linke
 wrote:

Hi,

I've noticed that CephFS (both ceph-fuse and kernel client in version
4.2.3)
remove files from page cache as soon as they are not in use by a process
anymore.

Is this intended behaviour? We use CephFS as a replacement for NFS in our
HPC cluster. It should serve large files which are read by multiple jobs
on
multiple hosts, so keeping them in the page cache over the duration of
several job invocations is crucial.

Yes. MDS needs resource to track the cached data. We don't want MDS
use too much resource.


Mount options are defaults,noatime,_netdev (+ extra options for the
kernel
client). Is there an option to keep data in page cache just like any
other
filesystem?

So far there is no option to do that. Later, we may add an option to
keep the cached data for a few seconds.


This renders CephFS useless for almost any HPC cluster application. And
keeping data for a few seconds is not a solution in most cases.

While I appreciate your frustration, that isn't an accurate statement.
For example, many physics HPC workloads use a network filesystem for
snapshotting their progress, where they dump their computed dataset at
regular intervals.  In these instances, having a cache of the data in
the pagecache is rarely if ever useful.
I completely agree. HPC workloads are different depending on your field, 
and even within a certain field the workloads may vary. The examples 
mentioned in another mail are just that. Examples. We also have other 
applications and other workloads. Traditional HPC cluster used to be 
isolated with respect to both compute nodes and storage; access was 
possible via a head node and maybe some NFS server. In our setup compute 
and storage are more integrated into the user's setup. I think the 
traditional model is becoming extinct in our field, given all the new 
developments in the last 15 years.




Moreover, in the general case of a shared filesystem with many nodes,
it is not to be assumed that the same client will be accessing the
same data repeatedly: there is an implicit hint in the use of a shared
filesystem that applications are likely to want to access that data
from different nodes, rather than the same node repeatedly.  Clearly
that is by no means true in all cases, but I think you may be
overestimating the generality of your own workload (not that we don't
want to make it faster for you)
As mentioned above, CephFS is not restricted to our cluster hosts. It is 
also available on interactive compute machines and even on desktops. And 
on this machines users expect data to be present in the cache if they 
want to start a computation a second time, e.g. after adjusting some 
parameters. I don't mind file access being slow on the batch machine. 
But our users do mind slow access on their day-to-day work.



CephFS supports capabilities to manages access to objects, enforce
consistency of data etc. IMHO a sane way to handle the page cache is use a
capability to inform the mds about caches objects; as long as no other
client claims write access to an object or its metadata, the cache copy is
considered consistent. Upon write access the client should drop the
capability (and thus remove the object from the page cache). If another
process tries to access a cache object with intact 'cache' capability, it
may be promoted to read/write capability.

This is essentially what we already do, except that we pro-actively
drop the capability when files are closed, rather than keeping it
around on the client in case its needed again.

Having those caps linger on a client is a tradeoff:
  * while it makes subsequent cached reads from the original client
nice and fast, it adds latency for any other client that wants to open
the file.
I assume the same is also true with the current situation, if the file 
is already opened by another client.

  * It also adds latency for the original client when it wants to open
many other files, because it will have to wait for the original file's
capabilities to be given up before it has room in its metadata cache
to open other files.
  * it also creates confusion if someone opens a big file, then closes
it, then wonders why their ceph-fuse process is still sitting on gigs
of memory
I agree on that. ceph-fuse processes already become way too large in my 
opinion:


  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ 
COMMAND
  902 root  20   0 3045056 1.680g   4328 S   0.0 21.5 338:23.78 
ceph-fuse


(and that's just a web server with some perl cgi stuff)

But the data itself should be stored in the page cache (dunno whether a 
fuse process can actually push data to the page cache).


Further, as Zheng pointed out, the design of cephfs requires that
whenever a client has capabilities on a file, it must also be in cache
on the MDS

Re: [ceph-users] CephFS and page cache

2015-10-19 Thread John Spray
On Mon, Oct 19, 2015 at 12:52 PM, Dan van der Ster  wrote:
>> So: the key thing to realise is that caching behaviour is full of
>> tradeoffs, and this is really something that needs to be tunable, so
>> that it can be adapted to the differing needs of different workloads.
>> Having an optional "hold onto caps for N seconds after file close"
>> sounds like it would be the right tunable for your use case, right?
>>
>
> I think that would help. Caching is pretty essential so we'd buy more
> MDS's and loads of RAM if CephFS became a central part of our
> infrastructure.
>
> But looking forward, if CephFS could support the immutable bit --
> chattr +i  -- then maybe the MDS wouldn't need to track clients
> who have such files cached. (Immutable files would be useful for other
> reasons too, like archiving!)

Hmm, if we didn't know which clients were accessing a +i file, then we
would have to do a global broadcast when we removed the attribute
(typically to delete the file), to tell all the clients to drop the
file from cache, and wait for a message from every client indicating
that they had dropped it.

For example, if you had an immutable /usr/bin/foo, and someone had it
in their cache but the MDS didn't know they had it in their cache,
then when you upgraded that package (replaced the binary file with
another), the MDS wouldn't know who to tell that they should no longer
use the old instance of the file.

More generally though, there probably are good cases where we should
do different caching behaviour on +i files, I hadn't really thought
about it before.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread John Spray
On Mon, Oct 19, 2015 at 12:52 PM, Dan van der Ster  wrote:
> Your assumption doesn't match what I've seen (in high energy physics
> (HEP)). The implicit hint you describe is much more apparent when
> clients use object storage APIs like S3 or one of the oodles of
> network storage systems we use in high energy physics. But NFS-like
> shared filesystems are different. This is where we'll put
> applications, libraries, configurations, configuration _data_ -- all
> things which indeed _are_ likely to be re-used by the same client many
> times. Consider these use-cases: a physicist is developing an analysis
> which is linked against 100's of headers in CephFS, recompiling many
> times, and also 100's of other users doing the same with the same
> headers; or a batch processing node is running the same data analysis
> code (hundreds/thousands of libraries in CephFS) on different input
> files.
>
> Files are re-accessed so often in HEP that we developed a new
> immutable-only, cache-forever filesystem for application distribution
> (CVMFS). And in places where we use OpenAFS we make use of readonly
> replicas to ensure that clients can cache as often as possible.

You've caught me being a bit optimistic:-)  In my idealistic version
of reality, people would only use shared filesystems for truly shared
data, rather than things like headers and libraries, but in the real
world it isn't so (you are correct).

It's a difficult situation: for example, I remember a case where
someone wanted to improve small file open performance on Lustre,
because they were putting their python site-packages directory in
Lustre, and then starting thousands of processes that all wanted to
scan it at the same time on startup.  The right answer was "please
don't do that!  Just use a local copy of site-packages on each node,
or put it on a small SSD filer", but in practice it is much more
convenient for people to have a single global filesystem.

Other examples spring to mind:
 * Home directories where someone's browser cache ends up on a
triply-redundant distributed filesystem (it's completely disposable
data!  This is so wasteful!)
 * Small file create workloads from compilations (put your .o files in
/tmp, not on a million dollar storage cluster!)

These are arguably workloads that just shouldn't be on a distributed
filesystem to begin with, but unfortunately developers do not always
get to choose the workloads that people will run :-)

In the future, there could be scope for doing interesting things with
layouts to support some radically different policies, e.g. having
"read replica" directories that do really coarse-grained caching and
rely on a global broadcast to do invalidation.  The trouble is that
these things are a lot of work to implement, and they still rely on
the user to remember to set the right flags on the right directories.
It would be pretty interesting though to have e.g. an intern spend
some time coming up with a caching policy that worked super-well for
your use case, so that we had a better idea of how much work it would
really be.  A project like this could be something like taking the
CVMFS/OpenAFS behaviours that you like, and building them into CephFS
as optional modes.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Cephx vs. Kerberos

2015-10-19 Thread Joao Eduardo Luis
CC-ing ceph-users where this message belongs.


On 10/16/2015 05:41 PM, Michael Joy wrote:
> Hey Everyone,
> 
> Is is possible to use Kerberos for authentication vs. the built in
> Cephx? Does anyone know the process to get it working if it is possible?

No, but it is on the wishlist for Jewel. Let's see how much progress is
made until then.

  -Joao

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Stijn De Weirdt
>>> So: the key thing to realise is that caching behaviour is full of
>>> tradeoffs, and this is really something that needs to be tunable, so
>>> that it can be adapted to the differing needs of different workloads.
>>> Having an optional "hold onto caps for N seconds after file close"
>>> sounds like it would be the right tunable for your use case, right?
>>>
>>
>> I think that would help. Caching is pretty essential so we'd buy more
>> MDS's and loads of RAM if CephFS became a central part of our
>> infrastructure.
>>
>> But looking forward, if CephFS could support the immutable bit --
>> chattr +i  -- then maybe the MDS wouldn't need to track clients
>> who have such files cached. (Immutable files would be useful for other
>> reasons too, like archiving!)
> 
> Hmm, if we didn't know which clients were accessing a +i file, then we
> would have to do a global broadcast when we removed the attribute
> (typically to delete the file), to tell all the clients to drop the
> file from cache, and wait for a message from every client indicating
> that they had dropped it.
> 
> For example, if you had an immutable /usr/bin/foo, and someone had it
> in their cache but the MDS didn't know they had it in their cache,
> then when you upgraded that package (replaced the binary file with
> another), the MDS wouldn't know who to tell that they should no longer
> use the old instance of the file.
> 
> More generally though, there probably are good cases where we should
> do different caching behaviour on +i files, I hadn't really thought
> about it before.
if we are making wishlist, how about read cache support for cephfs
snapshots?

stijn

> 
> John
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] nhm ceph is down

2015-10-19 Thread Mark Nelson

Hi,

It got taken down when there was that security issue on ceph.com a 
couple of weeks back.  I'll bug the website admins again about getting 
it back up.


Mark

On 10/19/2015 06:13 AM, Iezzi, Federico wrote:

Hi there,

The content sharing at http://nhm.ceph.com/ is not anymore reachable on
Internet.

Could you please fix it?

Thanks,

F.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread LOPEZ Jean-Charles
Hi,

when an OSD gets full, any write operation to the entire cluster will be 
disabled.

As a result, creating a single RBD will become impossible and all VMs that need 
to write to one of their Ceph back RBDs will suffer the same pain.

Usually, this ends up as a bad sorry for the VMs.

The best practice is to monitor the disk space usage for the OSDs and as a 
matter of fact RHCS 1.# includes a cep old df command to do this. You can also 
use the output of the cep old report command to grab the appropriate info to 
compute it or rely on external SNMP monitoring tools to grab the usage details 
of the particular OSD disk drives.

Have a great day.
JC

> On Oct 19, 2015, at 02:32, Bharath Krishna  wrote:
> 
> I mean cluster OSDs are physically full.
> 
> I understand its not a pretty way to operate CEPH allowing to become full,
> but I just wanted to know the boundary condition if it becomes full.
> 
> Will cinder create volume operation creates new volume at all or error is
> thrown at Cinder API level itself stating that no space available?
> 
> When IO stalls, will I be able to read the data from CEPH cluster I.e can
> I still read data from existing volumes created from CEPH cluster?
> 
> Thanks for the quick reply.
> 
> Regards
> M Bharath Krishna
> 
> On 10/19/15, 2:51 PM, "Jan Schermer"  wrote:
> 
>> Do you mean when the CEPH cluster (OSDs) is physically full or when the
>> quota is reached?
>> 
>> If CEPH becomes full it just stalls all IO (maybe just write IO, but
>> effectively same thing) - not pretty and you must never ever let it
>> become full.
>> 
>> Jan
>> 
>> 
>>> On 19 Oct 2015, at 11:15, Bharath Krishna 
>>> wrote:
>>> 
>>> Hi
>>> 
>>> What happens when Cinder service with CEPH backend storage cluster
>>> capacity is FULL?
>>> 
>>> What would be the out come of new cinder create volume request?
>>> 
>>> Will volume be created with space not available for use or an error
>>> thrown from Cinder API stating no space available for new volume.
>>> 
>>> I could not try this in my environment and fill up the cluster.
>>> 
>>> Please reply if you have ever tried and tested this.
>>> 
>>> Thank you.
>>> 
>>> Regards,
>>> M Bharath Krishna
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and page cache

2015-10-19 Thread Brian Kroth

John Spray  2015-10-19 11:34:


CephFS supports capabilities to manages access to objects, enforce
consistency of data etc. IMHO a sane way to handle the page cache is use a
capability to inform the mds about caches objects; as long as no other
client claims write access to an object or its metadata, the cache copy is
considered consistent. Upon write access the client should drop the
capability (and thus remove the object from the page cache). If another
process tries to access a cache object with intact 'cache' capability, it
may be promoted to read/write capability.


This is essentially what we already do, except that we pro-actively
drop the capability when files are closed, rather than keeping it
around on the client in case its needed again.

Having those caps linger on a client is a tradeoff:
* while it makes subsequent cached reads from the original client
nice and fast, it adds latency for any other client that wants to open
the file.
* It also adds latency for the original client when it wants to open
many other files, because it will have to wait for the original file's
capabilities to be given up before it has room in its metadata cache
to open other files.
* it also creates confusion if someone opens a big file, then closes
it, then wonders why their ceph-fuse process is still sitting on gigs
of memory


This doesn't really address any of the cache coherency issues raised on 
the thread, but I wonder if hooking into something like the cachefs 
facility would alleviate the memory concerns raise above?


The idea would basically be to just store the cached cephfs blocks on a 
local fs and let the regular cachefs/localfs facilities handle keeping 
those pages in the pagecache or not as necessary.  In the case that a 
file is reopened on a client later on, it still would need to 
reestablish its cache validity with the server(s) but then could 
(hopefully) just get the data from the local cachefs which (hopefully) 
may also still have it still in the local pagecache.  If not, all you've 
done is a few local OS calls which shouldn't be too bad compared to the 
network IO required to fetch the full file contents again from the 
server(s).  That seems too simple so there must be another wrinkle in 
there somewhere, right?


The only thread I was able to find on something like that in the past 
appears to have died in kernel panics [1].


Thanks,
Brian

[1] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Keystone RADOSGW ACLs

2015-10-19 Thread Will . Boege
I'm working with some teams who would like to not only create ACLs within 
RADOSGW to a tenant level, they would like to tailor ACLs to users within that 
tenant.  After trial and error, I can only seem to get ACLs to stick at a 
tenant level using the keystone tenant ID uuid.

Is this expected behavior for RadosGW ?  Can you only assign bucket ACLs on a 
tenant level with Keystone auth?  There doesn't seem to be a lot of doco out 
there around RadosGW with Keystone auth and its implications.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread James (Fei) Liu-SSI
Hi John,
Thanks for your explanations.

Actually, clients can.  Clients can request fairly complex operations like 
"read an xattr, stop if it's not there, now write the following discontinuous 
regions of the file...".  RADOS executes these transactions atomically.
[James]  Could you mind detailing  a little bit more about operations in 
Rados transactions?  Is there any limits number of ops in one rados 
transaction? What if we come out similar transaction capabilities either in new 
file system or keyvalue store to map what rados transaction has?  If we can 
come out solution like what Jan proposed: 1:1 mapping for transactions between 
filesystem/keyvaluestore, we don't necessary to have journaling in objectstore 
which is going to dramatically improve the performance of Ceph.

Thanks.

Regards,
James

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John 
Spray
Sent: Monday, October 19, 2015 3:44 AM
To: Jan Schermer
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

On Mon, Oct 19, 2015 at 8:55 AM, Jan Schermer  wrote:
> I understand this. But the clients can't request something that 
> doesn't fit a (POSIX) filesystem capabilities

Actually, clients can.  Clients can request fairly complex operations like 
"read an xattr, stop if it's not there, now write the following discontinuous 
regions of the file...".  RADOS executes these transactions atomically.

However, you are correct that for many cases (new files, sequential
writes) it is possible to avoid the double write of data: the in-development 
newstore backend does that.  But we still have cases where we do fancier things 
than the backend (be it posix, or a KV
store) can handle, so will have non-fast-path higher overhead ways of handling 
it.

John

That means the requests can map 1:1 into the filestore (O_FSYNC from client == 
O_FSYNC on the filestore object... ).
Pagecache/io-schedulers are already smart enough to merge requests, preserve 
ordering - they just do the right thing already. It's true that in a 
distributed environment one async request can map to one OSD and then a 
synchronous one comes and needs the first one to be flushed beforehand, so that 
logic is presumably in place already - but I still don't see much need for a 
journal in there (btw in case of RBD with caching, this logic is probably not 
even needed at all and merging request in RBD cache makes more sense than 
merging somewhere down the line).
> It might be faster to merge small writes in journal when the journal is on 
> SSDs and filestore on spinning rust, but it will surely be slower (cpu bound 
> by ceph-osd?) when the filestore is fast enough or when the merging is not 
> optimal.
> I have never touched anything but a pure SSD cluster, though - I have always 
> been CPU bound and that's why I started thinking about this in the first 
> place. I'd love to have my disks saturated with requests from clients one day.
>
> Don't take this the wrong way, but I've been watching ceph perf talks and 
> stuff and haven't seen anything that would make Ceph comparably fast to an 
> ordinary SAN/NAS.
> Maybe this is a completely wrong idea, I just think it might be worth 
> thinking about.
>
> Thanks
>
> Jan
>
>
>> On 14 Oct 2015, at 20:29, Somnath Roy  wrote:
>>
>> FileSystem like XFS guarantees a single file write but in Ceph transaction 
>> we are touching file/xattrs/leveldb (omap), so no way filesystem can 
>> guarantee that transaction. That's why FileStore has implemented a 
>> write_ahead journal. Basically, it is writing the entire transaction object 
>> there and only trimming from journal when it is actually applied (all the 
>> operation executed) and persisted in the backend.
>>
>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: Jan Schermer [mailto:j...@schermer.cz]
>> Sent: Wednesday, October 14, 2015 9:06 AM
>> To: Somnath Roy
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>>
>> But that's exactly what filesystems and their own journals do already 
>> :-)
>>
>> Jan
>>
>>> On 14 Oct 2015, at 17:02, Somnath Roy  wrote:
>>>
>>> Jan,
>>> Journal helps FileStore to maintain the transactional integrity in the 
>>> event of a crash. That's the main reason.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On 
>>> Behalf Of Jan Schermer
>>> Sent: Wednesday, October 14, 2015 2:28 AM
>>> To: ceph-users@lists.ceph.com
>>> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>>>
>>> Hi,
>>> I've been thinking about this for a while now - does Ceph really need a 
>>> journal? Filesystems are already pretty good at committing data to disk 
>>> when asked (and much faster too), we have external journals in XFS and 
>>> Ext4...
>>> In a scenario where client does an 

Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread Andrew Woodward
Cinder will periodically inspect the free space of the volume services and
use this data when determining which one to schedule to when a request is
received. In this case the cinder volume create request may error out in
scheduling. You may also see an error when instantiating a volume from an
image if it passes the prior but then becomes out of space during writing
the image to the volume.

I'm not sure if it's still the case, but in Havana (I see no reason for it
to change) the free space check in cinder didn't account for the difference
between promised space (the max of the volumes assigned) instead it would
literally look for free space in the output of `rados df`

As noted above if the cluster gets to "100%" used, bad things will happen
to your VM's. The most likely case is that they all assert read-only
filesystems. (100% is a missnomer as there is a configured max % where it
will stop accepting data writes to ensure that important object replication
/ maintenance can occur and have the cluster not fall over)

On Mon, Oct 19, 2015 at 7:51 AM LOPEZ Jean-Charles 
wrote:

> Hi,
>
> when an OSD gets full, any write operation to the entire cluster will be
> disabled.
>
> As a result, creating a single RBD will become impossible and all VMs that
> need to write to one of their Ceph back RBDs will suffer the same pain.
>
> Usually, this ends up as a bad sorry for the VMs.
>
> The best practice is to monitor the disk space usage for the OSDs and as a
> matter of fact RHCS 1.# includes a cep old df command to do this. You can
> also use the output of the cep old report command to grab the appropriate
> info to compute it or rely on external SNMP monitoring tools to grab the
> usage details of the particular OSD disk drives.
>
> Have a great day.
> JC
>
> > On Oct 19, 2015, at 02:32, Bharath Krishna 
> wrote:
> >
> > I mean cluster OSDs are physically full.
> >
> > I understand its not a pretty way to operate CEPH allowing to become
> full,
> > but I just wanted to know the boundary condition if it becomes full.
> >
> > Will cinder create volume operation creates new volume at all or error is
> > thrown at Cinder API level itself stating that no space available?
> >
> > When IO stalls, will I be able to read the data from CEPH cluster I.e can
> > I still read data from existing volumes created from CEPH cluster?
> >
> > Thanks for the quick reply.
> >
> > Regards
> > M Bharath Krishna
> >
> > On 10/19/15, 2:51 PM, "Jan Schermer"  wrote:
> >
> >> Do you mean when the CEPH cluster (OSDs) is physically full or when the
> >> quota is reached?
> >>
> >> If CEPH becomes full it just stalls all IO (maybe just write IO, but
> >> effectively same thing) - not pretty and you must never ever let it
> >> become full.
> >>
> >> Jan
> >>
> >>
> >>> On 19 Oct 2015, at 11:15, Bharath Krishna 
> >>> wrote:
> >>>
> >>> Hi
> >>>
> >>> What happens when Cinder service with CEPH backend storage cluster
> >>> capacity is FULL?
> >>>
> >>> What would be the out come of new cinder create volume request?
> >>>
> >>> Will volume be created with space not available for use or an error
> >>> thrown from Cinder API stating no space available for new volume.
> >>>
> >>> I could not try this in my environment and fill up the cluster.
> >>>
> >>> Please reply if you have ever tried and tested this.
> >>>
> >>> Thank you.
> >>>
> >>> Regards,
> >>> M Bharath Krishna
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 

--

Andrew Woodward

Mirantis

Fuel Community Ambassador

Ceph Community
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Jan Schermer
I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what 
other people using Ceph think.

If I were to use RADOS directly in my app I'd probably rejoice at its 
capabilities and how useful and non-legacy it is, but my use is basically for 
RBD volumes with OpenStack (libvirt, qemu...). And for that those capabilities 
are unneeded.
I live in this RBD bubble so that's all I know, but isn't this also the only 
usage pattern that 90% (or more) people using Ceph care about? Isn't this what 
drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it comes to 
displacing traditional (DAS, SAN, NAS) solutions the overhead (=complexity) of 
Ceph?*

What are the apps that actually use the RADOS features? I know Swift has some 
RADOS backend (which does the same thing Swift already did by itself, maybe 
with stronger consistency?), RGW (which basically does the same as Swift?) - 
doesn't seem either of those would need anything special. What else is there?
Apps that needed more than POSIX semantics (like databases for transactions) 
already developed mechanisms to do that - how likely is my database server to 
replace those mechanisms with RADOS API and objects in the future? It's all 
posix-filesystem-centric and that's not going away.

Ceph feels like a perfect example of this 
https://en.wikipedia.org/wiki/Inner-platform_effect

I was really hoping there was an easy way to just get rid of journal and 
operate on filestore directly - that should suffice for anyone using RBD only  
(in fact until very recently I thought it was possible to just disable journal 
in config...)

Jan

* look at what other solutions do to get better performance - RDMA for example. 
You can't really get true RDMA performance if you're not touching the drive DMA 
buffer (or something else very close to data) over network directly with 
minimal latency. That doesn't (IMHO) preclude software-defined-storage like 
Ceph from working over RDMA, but you probably should't try to outsmart the IO 
patterns...

> On 19 Oct 2015, at 19:44, James (Fei) Liu-SSI  
> wrote:
> 
> Hi John,
>Thanks for your explanations.
> 
>Actually, clients can.  Clients can request fairly complex operations like 
> "read an xattr, stop if it's not there, now write the following discontinuous 
> regions of the file...".  RADOS executes these transactions atomically.
>[James]  Could you mind detailing  a little bit more about operations in 
> Rados transactions?  Is there any limits number of ops in one rados 
> transaction? What if we come out similar transaction capabilities either in 
> new file system or keyvalue store to map what rados transaction has?  If we 
> can come out solution like what Jan proposed: 1:1 mapping for transactions 
> between filesystem/keyvaluestore, we don't necessary to have journaling in 
> objectstore which is going to dramatically improve the performance of Ceph.
> 
> Thanks.
> 
> Regards,
> James
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John 
> Spray
> Sent: Monday, October 19, 2015 3:44 AM
> To: Jan Schermer
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
> 
> On Mon, Oct 19, 2015 at 8:55 AM, Jan Schermer  wrote:
>> I understand this. But the clients can't request something that 
>> doesn't fit a (POSIX) filesystem capabilities
> 
> Actually, clients can.  Clients can request fairly complex operations like 
> "read an xattr, stop if it's not there, now write the following discontinuous 
> regions of the file...".  RADOS executes these transactions atomically.
> 
> However, you are correct that for many cases (new files, sequential
> writes) it is possible to avoid the double write of data: the in-development 
> newstore backend does that.  But we still have cases where we do fancier 
> things than the backend (be it posix, or a KV
> store) can handle, so will have non-fast-path higher overhead ways of 
> handling it.
> 
> John
> 
> That means the requests can map 1:1 into the filestore (O_FSYNC from client 
> == O_FSYNC on the filestore object... ).
> Pagecache/io-schedulers are already smart enough to merge requests, preserve 
> ordering - they just do the right thing already. It's true that in a 
> distributed environment one async request can map to one OSD and then a 
> synchronous one comes and needs the first one to be flushed beforehand, so 
> that logic is presumably in place already - but I still don't see much need 
> for a journal in there (btw in case of RBD with caching, this logic is 
> probably not even needed at all and merging request in RBD cache makes more 
> sense than merging somewhere down the line).
>> It might be faster to merge small writes in journal when the journal is on 
>> SSDs and filestore on spinning rust, but it will surely be slower (cpu bound 
>> by ceph-osd?) when the filestore is fast enough or when the merging is not 
>> optimal.
>> I 

Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread Jan Schermer
Cinder checking free space will not help.
You will get one full OSD long before you run "out of space" from Ceph 
perspective, and it gets worse with the number of OSDs you have. Using 99% of 
space in Ceph is not the same as having all the OSDs 99% full because the data 
is not distributed in a completely fair fashion. Not sure how much that can be 
helped, but my cluster can store at most 2TB of data when claiming to have 14TB 
free.

You *really* need to monitor each OSD's free space and treat it with utmost 
criticality.

Jan


> On 19 Oct 2015, at 20:00, Andrew Woodward  wrote:
> 
> Cinder will periodically inspect the free space of the volume services and 
> use this data when determining which one to schedule to when a request is 
> received. In this case the cinder volume create request may error out in 
> scheduling. You may also see an error when instantiating a volume from an 
> image if it passes the prior but then becomes out of space during writing the 
> image to the volume.
> 
> I'm not sure if it's still the case, but in Havana (I see no reason for it to 
> change) the free space check in cinder didn't account for the difference 
> between promised space (the max of the volumes assigned) instead it would 
> literally look for free space in the output of `rados df`
> 
> As noted above if the cluster gets to "100%" used, bad things will happen to 
> your VM's. The most likely case is that they all assert read-only 
> filesystems. (100% is a missnomer as there is a configured max % where it 
> will stop accepting data writes to ensure that important object replication / 
> maintenance can occur and have the cluster not fall over)
> 
> On Mon, Oct 19, 2015 at 7:51 AM LOPEZ Jean-Charles  > wrote:
> Hi,
> 
> when an OSD gets full, any write operation to the entire cluster will be 
> disabled.
> 
> As a result, creating a single RBD will become impossible and all VMs that 
> need to write to one of their Ceph back RBDs will suffer the same pain.
> 
> Usually, this ends up as a bad sorry for the VMs.
> 
> The best practice is to monitor the disk space usage for the OSDs and as a 
> matter of fact RHCS 1.# includes a cep old df command to do this. You can 
> also use the output of the cep old report command to grab the appropriate 
> info to compute it or rely on external SNMP monitoring tools to grab the 
> usage details of the particular OSD disk drives.
> 
> Have a great day.
> JC
> 
> > On Oct 19, 2015, at 02:32, Bharath Krishna  > > wrote:
> >
> > I mean cluster OSDs are physically full.
> >
> > I understand its not a pretty way to operate CEPH allowing to become full,
> > but I just wanted to know the boundary condition if it becomes full.
> >
> > Will cinder create volume operation creates new volume at all or error is
> > thrown at Cinder API level itself stating that no space available?
> >
> > When IO stalls, will I be able to read the data from CEPH cluster I.e can
> > I still read data from existing volumes created from CEPH cluster?
> >
> > Thanks for the quick reply.
> >
> > Regards
> > M Bharath Krishna
> >
> > On 10/19/15, 2:51 PM, "Jan Schermer"  > > wrote:
> >
> >> Do you mean when the CEPH cluster (OSDs) is physically full or when the
> >> quota is reached?
> >>
> >> If CEPH becomes full it just stalls all IO (maybe just write IO, but
> >> effectively same thing) - not pretty and you must never ever let it
> >> become full.
> >>
> >> Jan
> >>
> >>
> >>> On 19 Oct 2015, at 11:15, Bharath Krishna  >>> >
> >>> wrote:
> >>>
> >>> Hi
> >>>
> >>> What happens when Cinder service with CEPH backend storage cluster
> >>> capacity is FULL?
> >>>
> >>> What would be the out come of new cinder create volume request?
> >>>
> >>> Will volume be created with space not available for use or an error
> >>> thrown from Cinder API stating no space available for new volume.
> >>>
> >>> I could not try this in my environment and fill up the cluster.
> >>>
> >>> Please reply if you have ever tried and tested this.
> >>>
> >>> Thank you.
> >>>
> >>> Regards,
> >>> M Bharath Krishna
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com 
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> >>> 
> >>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> -- 
> --

Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread John Spray
On Mon, Oct 19, 2015 at 7:28 PM, Jan Schermer  wrote:
> Cinder checking free space will not help.
> You will get one full OSD long before you run "out of space" from Ceph
> perspective, and it gets worse with the number of OSDs you have. Using 99%
> of space in Ceph is not the same as having all the OSDs 99% full because the
> data is not distributed in a completely fair fashion. Not sure how much that
> can be helped, but my cluster can store at most 2TB of data when claiming to
> have 14TB free.

The "max_avail" per-pool number that you get out of "ceph df" is aware
of this, and will calculate that actual writeable capacity based on
whatever OSD has the least available space.  From a quick look at the
code, it seems that the RBD Cinder plugin reports free_capacity_gb
from max_avail, so unless you're seeing a different behaviour I don't
think we have a problem.

This is me looking at master ceph and master cinder, so no idea which
released versions got this behaviour (the cinder code was modified in
March this year).

John

>
> You *really* need to monitor each OSD's free space and treat it with utmost
> criticality.
>
> Jan
>
>
> On 19 Oct 2015, at 20:00, Andrew Woodward  wrote:
>
> Cinder will periodically inspect the free space of the volume services and
> use this data when determining which one to schedule to when a request is
> received. In this case the cinder volume create request may error out in
> scheduling. You may also see an error when instantiating a volume from an
> image if it passes the prior but then becomes out of space during writing
> the image to the volume.
>
> I'm not sure if it's still the case, but in Havana (I see no reason for it
> to change) the free space check in cinder didn't account for the difference
> between promised space (the max of the volumes assigned) instead it would
> literally look for free space in the output of `rados df`
>
> As noted above if the cluster gets to "100%" used, bad things will happen to
> your VM's. The most likely case is that they all assert read-only
> filesystems. (100% is a missnomer as there is a configured max % where it
> will stop accepting data writes to ensure that important object replication
> / maintenance can occur and have the cluster not fall over)
>
> On Mon, Oct 19, 2015 at 7:51 AM LOPEZ Jean-Charles 
> wrote:
>>
>> Hi,
>>
>> when an OSD gets full, any write operation to the entire cluster will be
>> disabled.
>>
>> As a result, creating a single RBD will become impossible and all VMs that
>> need to write to one of their Ceph back RBDs will suffer the same pain.
>>
>> Usually, this ends up as a bad sorry for the VMs.
>>
>> The best practice is to monitor the disk space usage for the OSDs and as a
>> matter of fact RHCS 1.# includes a cep old df command to do this. You can
>> also use the output of the cep old report command to grab the appropriate
>> info to compute it or rely on external SNMP monitoring tools to grab the
>> usage details of the particular OSD disk drives.
>>
>> Have a great day.
>> JC
>>
>> > On Oct 19, 2015, at 02:32, Bharath Krishna 
>> > wrote:
>> >
>> > I mean cluster OSDs are physically full.
>> >
>> > I understand its not a pretty way to operate CEPH allowing to become
>> > full,
>> > but I just wanted to know the boundary condition if it becomes full.
>> >
>> > Will cinder create volume operation creates new volume at all or error
>> > is
>> > thrown at Cinder API level itself stating that no space available?
>> >
>> > When IO stalls, will I be able to read the data from CEPH cluster I.e
>> > can
>> > I still read data from existing volumes created from CEPH cluster?
>> >
>> > Thanks for the quick reply.
>> >
>> > Regards
>> > M Bharath Krishna
>> >
>> > On 10/19/15, 2:51 PM, "Jan Schermer"  wrote:
>> >
>> >> Do you mean when the CEPH cluster (OSDs) is physically full or when the
>> >> quota is reached?
>> >>
>> >> If CEPH becomes full it just stalls all IO (maybe just write IO, but
>> >> effectively same thing) - not pretty and you must never ever let it
>> >> become full.
>> >>
>> >> Jan
>> >>
>> >>
>> >>> On 19 Oct 2015, at 11:15, Bharath Krishna 
>> >>> wrote:
>> >>>
>> >>> Hi
>> >>>
>> >>> What happens when Cinder service with CEPH backend storage cluster
>> >>> capacity is FULL?
>> >>>
>> >>> What would be the out come of new cinder create volume request?
>> >>>
>> >>> Will volume be created with space not available for use or an error
>> >>> thrown from Cinder API stating no space available for new volume.
>> >>>
>> >>> I could not try this in my environment and fill up the cluster.
>> >>>
>> >>> Please reply if you have ever tried and tested this.
>> >>>
>> >>> Thank you.
>> >>>
>> >>> Regards,
>> >>> M Bharath Krishna
>> >>> ___
>> >>> ceph-users mailing list
>> >>> ceph-users@lists.ceph.com
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> > ___
>> > ceph-users mailing list
>> >

Re: [ceph-users] Cinder + CEPH Storage Full Scenario

2015-10-19 Thread Jan Schermer
Sorry about that, I guess newer releases than my Dumpling calculate it 
differently, then.
I can take a look tomorrow at the exact numbers I get, but I'm pretty sure it's 
just a sum on D.

Jan

> On 19 Oct 2015, at 20:40, John Spray  wrote:
> 
> On Mon, Oct 19, 2015 at 7:28 PM, Jan Schermer  wrote:
>> Cinder checking free space will not help.
>> You will get one full OSD long before you run "out of space" from Ceph
>> perspective, and it gets worse with the number of OSDs you have. Using 99%
>> of space in Ceph is not the same as having all the OSDs 99% full because the
>> data is not distributed in a completely fair fashion. Not sure how much that
>> can be helped, but my cluster can store at most 2TB of data when claiming to
>> have 14TB free.
> 
> The "max_avail" per-pool number that you get out of "ceph df" is aware
> of this, and will calculate that actual writeable capacity based on
> whatever OSD has the least available space.  From a quick look at the
> code, it seems that the RBD Cinder plugin reports free_capacity_gb
> from max_avail, so unless you're seeing a different behaviour I don't
> think we have a problem.
> 
> This is me looking at master ceph and master cinder, so no idea which
> released versions got this behaviour (the cinder code was modified in
> March this year).
> 
> John
> 
>> 
>> You *really* need to monitor each OSD's free space and treat it with utmost
>> criticality.
>> 
>> Jan
>> 
>> 
>> On 19 Oct 2015, at 20:00, Andrew Woodward  wrote:
>> 
>> Cinder will periodically inspect the free space of the volume services and
>> use this data when determining which one to schedule to when a request is
>> received. In this case the cinder volume create request may error out in
>> scheduling. You may also see an error when instantiating a volume from an
>> image if it passes the prior but then becomes out of space during writing
>> the image to the volume.
>> 
>> I'm not sure if it's still the case, but in Havana (I see no reason for it
>> to change) the free space check in cinder didn't account for the difference
>> between promised space (the max of the volumes assigned) instead it would
>> literally look for free space in the output of `rados df`
>> 
>> As noted above if the cluster gets to "100%" used, bad things will happen to
>> your VM's. The most likely case is that they all assert read-only
>> filesystems. (100% is a missnomer as there is a configured max % where it
>> will stop accepting data writes to ensure that important object replication
>> / maintenance can occur and have the cluster not fall over)
>> 
>> On Mon, Oct 19, 2015 at 7:51 AM LOPEZ Jean-Charles 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> when an OSD gets full, any write operation to the entire cluster will be
>>> disabled.
>>> 
>>> As a result, creating a single RBD will become impossible and all VMs that
>>> need to write to one of their Ceph back RBDs will suffer the same pain.
>>> 
>>> Usually, this ends up as a bad sorry for the VMs.
>>> 
>>> The best practice is to monitor the disk space usage for the OSDs and as a
>>> matter of fact RHCS 1.# includes a cep old df command to do this. You can
>>> also use the output of the cep old report command to grab the appropriate
>>> info to compute it or rely on external SNMP monitoring tools to grab the
>>> usage details of the particular OSD disk drives.
>>> 
>>> Have a great day.
>>> JC
>>> 
 On Oct 19, 2015, at 02:32, Bharath Krishna 
 wrote:
 
 I mean cluster OSDs are physically full.
 
 I understand its not a pretty way to operate CEPH allowing to become
 full,
 but I just wanted to know the boundary condition if it becomes full.
 
 Will cinder create volume operation creates new volume at all or error
 is
 thrown at Cinder API level itself stating that no space available?
 
 When IO stalls, will I be able to read the data from CEPH cluster I.e
 can
 I still read data from existing volumes created from CEPH cluster?
 
 Thanks for the quick reply.
 
 Regards
 M Bharath Krishna
 
 On 10/19/15, 2:51 PM, "Jan Schermer"  wrote:
 
> Do you mean when the CEPH cluster (OSDs) is physically full or when the
> quota is reached?
> 
> If CEPH becomes full it just stalls all IO (maybe just write IO, but
> effectively same thing) - not pretty and you must never ever let it
> become full.
> 
> Jan
> 
> 
>> On 19 Oct 2015, at 11:15, Bharath Krishna 
>> wrote:
>> 
>> Hi
>> 
>> What happens when Cinder service with CEPH backend storage cluster
>> capacity is FULL?
>> 
>> What would be the out come of new cinder create volume request?
>> 
>> Will volume be created with space not available for use or an error
>> thrown from Cinder API stating no space available for new volume.
>> 
>> I could not try this in my environment and fill up the cluster.
>> 
>> Please reply if you have ever tried and 

Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I think if there was a new disk format, we could get away without the
journal. It seems that Ceph is trying to do extra things because
regular file systems don't do exactly what is needed. I can understand
why the developers aren't excited about building and maintaining a new
disk format, but I think it could be pretty light and highly optimized
for object storage. I even started thinking through what one might
look like, but I've never written a file system so I'm probably just
living in a fantasy land. I still might try...
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Oct 19, 2015 at 12:18 PM, Jan Schermer  wrote:
> I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what 
> other people using Ceph think.
>
> If I were to use RADOS directly in my app I'd probably rejoice at its 
> capabilities and how useful and non-legacy it is, but my use is basically for 
> RBD volumes with OpenStack (libvirt, qemu...). And for that those 
> capabilities are unneeded.
> I live in this RBD bubble so that's all I know, but isn't this also the only 
> usage pattern that 90% (or more) people using Ceph care about? Isn't this 
> what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it 
> comes to displacing traditional (DAS, SAN, NAS) solutions the overhead 
> (=complexity) of Ceph?*
>
> What are the apps that actually use the RADOS features? I know Swift has some 
> RADOS backend (which does the same thing Swift already did by itself, maybe 
> with stronger consistency?), RGW (which basically does the same as Swift?) - 
> doesn't seem either of those would need anything special. What else is there?
> Apps that needed more than POSIX semantics (like databases for transactions) 
> already developed mechanisms to do that - how likely is my database server to 
> replace those mechanisms with RADOS API and objects in the future? It's all 
> posix-filesystem-centric and that's not going away.
>
> Ceph feels like a perfect example of this 
> https://en.wikipedia.org/wiki/Inner-platform_effect
>
> I was really hoping there was an easy way to just get rid of journal and 
> operate on filestore directly - that should suffice for anyone using RBD only 
>  (in fact until very recently I thought it was possible to just disable 
> journal in config...)
>
> Jan
>
> * look at what other solutions do to get better performance - RDMA for 
> example. You can't really get true RDMA performance if you're not touching 
> the drive DMA buffer (or something else very close to data) over network 
> directly with minimal latency. That doesn't (IMHO) preclude 
> software-defined-storage like Ceph from working over RDMA, but you probably 
> should't try to outsmart the IO patterns...
>
>> On 19 Oct 2015, at 19:44, James (Fei) Liu-SSI  wrote:
>>
>> Hi John,
>>Thanks for your explanations.
>>
>>Actually, clients can.  Clients can request fairly complex operations 
>> like "read an xattr, stop if it's not there, now write the following 
>> discontinuous regions of the file...".  RADOS executes these transactions 
>> atomically.
>>[James]  Could you mind detailing  a little bit more about operations in 
>> Rados transactions?  Is there any limits number of ops in one rados 
>> transaction? What if we come out similar transaction capabilities either in 
>> new file system or keyvalue store to map what rados transaction has?  If we 
>> can come out solution like what Jan proposed: 1:1 mapping for transactions 
>> between filesystem/keyvaluestore, we don't necessary to have journaling in 
>> objectstore which is going to dramatically improve the performance of Ceph.
>>
>> Thanks.
>>
>> Regards,
>> James
>>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> John Spray
>> Sent: Monday, October 19, 2015 3:44 AM
>> To: Jan Schermer
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>>
>> On Mon, Oct 19, 2015 at 8:55 AM, Jan Schermer  wrote:
>>> I understand this. But the clients can't request something that
>>> doesn't fit a (POSIX) filesystem capabilities
>>
>> Actually, clients can.  Clients can request fairly complex operations like 
>> "read an xattr, stop if it's not there, now write the following 
>> discontinuous regions of the file...".  RADOS executes these transactions 
>> atomically.
>>
>> However, you are correct that for many cases (new files, sequential
>> writes) it is possible to avoid the double write of data: the in-development 
>> newstore backend does that.  But we still have cases where we do fancier 
>> things than the backend (be it posix, or a KV
>> store) can handle, so will have non-fast-path higher overhead ways of 
>> handling it.
>>
>> John
>>
>> That means the requests can map 1:1 into the filestore (O_FSYNC from client 
>> == O_FSYNC on the filestore ob

Re: [ceph-users] error while upgrading to infernalis last release on OSD serv

2015-10-19 Thread Gregory Farnum
As the infernalis release notes state, if you're upgrading you first
need to step through the current development hammer branch or the
(not-quite-release 0.94.4).
-Greg

On Thu, Oct 15, 2015 at 7:27 AM, German Anders  wrote:
> Hi all,
>
> I'm trying to upgrade a ceph cluster (prev hammer release) to the last
> release of infernalis. So far so good while upgrading the mon servers, all
> work fine. But then when trying to upgrade the OSD servers I got an error
> while trying to start the osd services again:
>
> What I did is first to upgrade the packages, then stop the osd daemons, run
> the chown -R ceph:ceph /var/lib/ceph command, and then try to start again
> all the daemons. Well, they are not coming back and the error on one of the
> OSD is the following:
>
> (...)
> 5 10:21:05.910850
> os/FileStore.cc: 1698: FAILED assert(r == 0)
>
>  ceph version 9.1.0-61-gf2b9f89 (f2b9f898074db6473d993436e6aa566a945e3b40)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0x7fec7b74489b]
>  2: (FileStore::init_temp_collections()+0xb2d) [0x7fec7b40ea9d]
>  3: (FileStore::mount()+0x33bb) [0x7fec7b41206b]
>  4: (OSD::init()+0x269) [0x7fec7b1bc2f9]
>  5: (main()+0x2817) [0x7fec7b142bb7]
>  6: (__libc_start_main()+0xf5) [0x7fec77d68ec5]
>  7: (()+0x30a9e7) [0x7fec7b1729e7]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
> --- logging levels ---
>0/ 5 none
>0/ 1 lockdep
>0/ 1 context
>1/ 1 crush
>1/ 5 mds
>1/ 5 mds_balancer
>1/ 5 mds_locker
>1/ 5 mds_log
>1/ 5 mds_log_expire
>1/ 5 mds_migrator
>0/ 1 buffer
>0/ 1 timer
>0/ 1 filer
>0/ 1 striper
>0/ 1 objecter
>0/ 5 rados
>0/ 5 rbd
>0/ 5 rbd_replay
>0/ 5 journaler
>0/ 5 objectcacher
>0/ 5 client
>   20/20 osd
>0/ 5 optracker
>0/ 5 objclass
>   20/20 filestore
>1/ 3 keyvaluestore
>   20/20 journal
>1/ 1 ms
>1/ 5 mon
>0/10 monc
>1/ 5 paxos
>0/ 5 tp
>1/ 5 auth
>1/ 5 crypto
>1/ 1 finisher
>1/ 5 heartbeatmap
>1/ 5 perfcounter
>1/ 5 rgw
>1/10 civetweb
>1/ 5 javaclient
>1/ 5 asok
>1/ 1 throttle
>0/ 0 refs
>1/ 5 xio
>1/ 5 compressor
>1/ 5 newstore
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent 1
>   max_new 1000
>   log_file /var/log/ceph/ceph-osd.0.log
> --- end dump of recent events ---
> 2015-10-15 10:21:05.923314 7fec7bc4f980 -1 *** Caught signal (Aborted) **
>  in thread 7fec7bc4f980
>
>  ceph version 9.1.0-61-gf2b9f89 (f2b9f898074db6473d993436e6aa566a945e3b40)
>  1: (()+0x7f031a) [0x7fec7b65831a]
>  2: (()+0x10340) [0x7fec79b02340]
>  3: (gsignal()+0x39) [0x7fec77d7dcc9]
>  4: (abort()+0x148) [0x7fec77d810d8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec78688535]
>  6: (()+0x5e6d6) [0x7fec786866d6]
>  7: (()+0x5e703) [0x7fec78686703]
>  8: (()+0x5e922) [0x7fec78686922]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x278) [0x7fec7b744a88]
>  10: (FileStore::init_temp_collections()+0xb2d) [0x7fec7b40ea9d]
>  11: (FileStore::mount()+0x33bb) [0x7fec7b41206b]
>  12: (OSD::init()+0x269) [0x7fec7b1bc2f9]
>  13: (main()+0x2817) [0x7fec7b142bb7]
>  14: (__libc_start_main()+0xf5) [0x7fec77d68ec5]
>  15: (()+0x30a9e7) [0x7fec7b1729e7]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
> --- begin dump of recent events ---
>  0> 2015-10-15 10:21:05.923314 7fec7bc4f980 -1 *** Caught signal
> (Aborted) **
>  in thread 7fec7bc4f980
>
>  ceph version 9.1.0-61-gf2b9f89 (f2b9f898074db6473d993436e6aa566a945e3b40)
>  1: (()+0x7f031a) [0x7fec7b65831a]
>  2: (()+0x10340) [0x7fec79b02340]
>  3: (gsignal()+0x39) [0x7fec77d7dcc9]
>  4: (abort()+0x148) [0x7fec77d810d8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec78688535]
>  6: (()+0x5e6d6) [0x7fec786866d6]
>  7: (()+0x5e703) [0x7fec78686703]
>  8: (()+0x5e922) [0x7fec78686922]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x278) [0x7fec7b744a88]
>  10: (FileStore::init_temp_collections()+0xb2d) [0x7fec7b40ea9d]
>  11: (FileStore::mount()+0x33bb) [0x7fec7b41206b]
>  12: (OSD::init()+0x269) [0x7fec7b1bc2f9]
>  13: (main()+0x2817) [0x7fec7b142bb7]
>  14: (__libc_start_main()+0xf5) [0x7fec77d68ec5]
>  15: (()+0x30a9e7) [0x7fec7b1729e7]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
> --- logging levels ---
>0/ 5 none
>0/ 1 lockdep
>0/ 1 context
>1/ 1 crush
>1/ 5 mds
>1/ 5 mds_balancer
>1/ 5 mds_locker
>1/ 5 mds_log
>1/ 5 mds_log_expire
>1/ 5 mds_migrator
>0/ 1 buffer
>0/ 1 timer
>0/ 1 filer
>0/ 1 striper
>0/ 1 objecter
>0/ 5 rados
>0/ 5 rbd
>0/ 5 rbd_replay
>0/ 5 journaler
>0/ 5 objectcacher
>0/ 5 client
>   20/20 osd
>0/ 5 optracker
>0/ 5 objclass
>   20/20 filestore
>1/ 3 keyvaluestore
>   20/20 journa

[ceph-users] v0.94.4 Hammer released

2015-10-19 Thread Sage Weil
This Hammer point fixes several important bugs in Hammer, as well as
fixing interoperability issues that are required before an upgrade to
Infernalis. That is, all users of earlier version of Hammer or any
version of Firefly will first need to upgrade to hammer v0.94.4 or
later before upgrading to Infernalis (or future releases).

All v0.94.x Hammer users are strongly encouraged to upgrade.

Changes
---

* build/ops: ceph.spec.in: 50-rbd.rules conditional is wrong (#12166, Nathan 
Cutler)
* build/ops: ceph.spec.in: ceph-common needs python-argparse on older distros, 
but doesn't require it (#12034, Nathan Cutler)
* build/ops: ceph.spec.in: radosgw requires apache for SUSE only -- makes no 
sense (#12358, Nathan Cutler)
* build/ops: ceph.spec.in: rpm: cephfs_java not fully conditionalized (#11991, 
Nathan Cutler)
* build/ops: ceph.spec.in: rpm: not possible to turn off Java (#11992, Owen 
Synge)
* build/ops: ceph.spec.in: running fdupes unnecessarily (#12301, Nathan Cutler)
* build/ops: ceph.spec.in: snappy-devel for all supported distros (#12361, 
Nathan Cutler)
* build/ops: ceph.spec.in: SUSE/openSUSE builds need libbz2-devel (#11629, 
Nathan Cutler)
* build/ops: ceph.spec.in: useless %py_requires breaks SLE11-SP3 build (#12351, 
Nathan Cutler)
* build/ops: error in ext_mime_map_init() when /etc/mime.types is missing 
(#11864, Ken Dreyer)
* build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s) 
(#11798, Sage Weil)
* build/ops: With root as default user, unable to have multiple RGW instances 
running (#10927, Sage Weil)
* build/ops: With root as default user, unable to have multiple RGW instances 
running (#11140, Sage Weil)
* build/ops: With root as default user, unable to have multiple RGW instances 
running (#11686, Sage Weil)
* build/ops: With root as default user, unable to have multiple RGW instances 
running (#12407, Sage Weil)
* cli: ceph: cli throws exception on unrecognized errno (#11354, Kefu Chai)
* cli: ceph tell: broken error message / misleading hinting (#11101, Kefu Chai)
* common: arm: all programs that link to librados2 hang forever on startup 
(#12505, Boris Ranto)
* common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
* common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer objects (#13070, 
Sage Weil)
* common: do not insert emtpy ptr when rebuild emtpy bufferlist (#12775, Xinze 
Chi)
* common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman)
* common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
* common: Memory leak in Mutex.cc, pthread_mutexattr_init without 
pthread_mutexattr_destroy (#11762, Ketor Meng)
* common: object_map_update fails with -EINVAL return code (#12611, Jason 
Dillaman)
* common: Pipe: Drop connect_seq increase line (#13093, Haomai Wang)
* common: recursive lock of md_config_t (0) (#12614, Josh Durgin)
* crush: ceph osd crush reweight-subtree does not reweight parent node (#11855, 
Sage Weil)
* doc: update docs to point to download.ceph.com (#13162, Alfredo Deza)
* fs: ceph-fuse 0.94.2-1trusty segfaults / aborts (#12297, Greg Farnum)
* fs: segfault launching ceph-fuse with bad --name (#12417, John Spray)
* librados: Change radosgw pools default crush ruleset (#11640, Yuan Zhou)
* librbd: correct issues discovered via lockdep / helgrind (#12345, Jason 
Dillaman)
* librbd: Crash during TestInternal.MultipleResize (#12664, Jason Dillaman)
* librbd: deadlock during cooperative exclusive lock transition (#11537, Jason 
Dillaman)
* librbd: Possible crash while concurrently writing and shrinking an image 
(#11743, Jason Dillaman)
* mon: add a cache layer over MonitorDBStore (#12638, Kefu Chai)
* mon: fix crush testing for new pools (#13400, Sage Weil)
* mon: get pools health'info have error (#12402, renhwztetecs)
* mon: implicit erasure code crush ruleset is not validated (#11814, Loic 
Dachary)
* mon: PaxosService: call post_refresh() instead of post_paxos_update() 
(#11470, Joao Eduardo Luis)
* mon: pgmonitor: wrong at/near target max“ reporting (#12401, huangjun)
* mon: register_new_pgs() should check ruleno instead of its index (#12210, 
Xinze Chi)
* mon: Show osd as NONE in ceph osd mapoutput (#11820, 
Shylesh Kumar)
* mon: the output is wrong when runing ceph osd reweight (#12251, Joao Eduardo 
Luis)
* osd: allow peek_map_epoch to return an error (#13060, Sage Weil)
* osd: cache agent is idle although one object is left in the cache (#12673, 
Loic Dachary)
* osd: copy-from doesn't preserve truncate_{seq,size} (#12551, Samuel Just)
* osd: crash creating/deleting pools (#12429, John Spray)
* osd: fix repair when recorded digest is wrong (#12577, Sage Weil)
* osd: include/ceph_features: define HAMMER_0_94_4 feature (#13026, Sage Weil)
* osd: is_new_interval() fixes (#10399, Jason Dillaman)
* osd: is_new_interval() fixes (#11771, Jason Dillaman)
* osd: long standing slow requests: 
connection->session->waiting_for_map->connection ref cycle (#12338, Samuel Just)
* osd: Mutex Assert from PipeConnection::try_g

Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Gregory Farnum
On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer  wrote:
> I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what 
> other people using Ceph think.
>
> If I were to use RADOS directly in my app I'd probably rejoice at its 
> capabilities and how useful and non-legacy it is, but my use is basically for 
> RBD volumes with OpenStack (libvirt, qemu...). And for that those 
> capabilities are unneeded.
> I live in this RBD bubble so that's all I know, but isn't this also the only 
> usage pattern that 90% (or more) people using Ceph care about? Isn't this 
> what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it 
> comes to displacing traditional (DAS, SAN, NAS) solutions the overhead 
> (=complexity) of Ceph?*
>
> What are the apps that actually use the RADOS features? I know Swift has some 
> RADOS backend (which does the same thing Swift already did by itself, maybe 
> with stronger consistency?), RGW (which basically does the same as Swift?) - 
> doesn't seem either of those would need anything special. What else is there?
> Apps that needed more than POSIX semantics (like databases for transactions) 
> already developed mechanisms to do that - how likely is my database server to 
> replace those mechanisms with RADOS API and objects in the future? It's all 
> posix-filesystem-centric and that's not going away.
>
> Ceph feels like a perfect example of this 
> https://en.wikipedia.org/wiki/Inner-platform_effect
>
> I was really hoping there was an easy way to just get rid of journal and 
> operate on filestore directly - that should suffice for anyone using RBD only 
>  (in fact until very recently I thought it was possible to just disable 
> journal in config...)

The biggest thing you're missing here is that Ceph needs to keep *its*
data and metadata consistent. The filesystem journal does *not* let us
do that, so we need a journal of our own.

Could something be constructed to do that more efficiently? Probably,
with enough effort...but it's hard, and we don't have it right now,
and it will still require a Ceph journal, because Ceph will always
have its own metadata that needs to be kept consistent with its data.
(Short example: rbd client sends two writes. OSDs crash and restart.
client dies before they finish. OSDs try to reconstruct consistent
view of the data. If OSDs don't have the right metadata about which
writes have been applied, they can't tell who's got the newest data or
if somebody's missing some piece of it, and without journaling you
could get the second write applied but not the first, etc)

So no, just because RBD is a much simpler case doesn't mean we can
drop our journaling. Sorry, but the world isn't fair.

On Mon, Oct 19, 2015 at 12:18 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I think if there was a new disk format, we could get away without the
> journal. It seems that Ceph is trying to do extra things because
> regular file systems don't do exactly what is needed. I can understand
> why the developers aren't excited about building and maintaining a new
> disk format, but I think it could be pretty light and highly optimized
> for object storage. I even started thinking through what one might
> look like, but I've never written a file system so I'm probably just
> living in a fantasy land. I still might try...

Well, there used to be one called EBOFS that Sage (mostly) wrote. He
killed it because it had some limits, fixing them was hard, and it had
basically turned into a full filesystem. Now he's trying again with
NewStore, which will hopefully be super awesome and dramatically
reduce stuff like the double writes. But it's a lot harder than you
think; it's been his main dev topic off-and-on for almost a year, and
you can see the thread he just started about it, so ;)

Basically what you're both running into is that any consistent system
needs transactions, and providing them is a pain in the butt. Lots of
applications actually don't bother, but a storage system like Ceph
definitely does.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Jan Schermer

> On 19 Oct 2015, at 23:15, Gregory Farnum  wrote:
> 
> On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer  wrote:
>> I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what 
>> other people using Ceph think.
>> 
>> If I were to use RADOS directly in my app I'd probably rejoice at its 
>> capabilities and how useful and non-legacy it is, but my use is basically 
>> for RBD volumes with OpenStack (libvirt, qemu...). And for that those 
>> capabilities are unneeded.
>> I live in this RBD bubble so that's all I know, but isn't this also the only 
>> usage pattern that 90% (or more) people using Ceph care about? Isn't this 
>> what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it 
>> comes to displacing traditional (DAS, SAN, NAS) solutions the overhead 
>> (=complexity) of Ceph?*
>> 
>> What are the apps that actually use the RADOS features? I know Swift has 
>> some RADOS backend (which does the same thing Swift already did by itself, 
>> maybe with stronger consistency?), RGW (which basically does the same as 
>> Swift?) - doesn't seem either of those would need anything special. What 
>> else is there?
>> Apps that needed more than POSIX semantics (like databases for transactions) 
>> already developed mechanisms to do that - how likely is my database server 
>> to replace those mechanisms with RADOS API and objects in the future? It's 
>> all posix-filesystem-centric and that's not going away.
>> 
>> Ceph feels like a perfect example of this 
>> https://en.wikipedia.org/wiki/Inner-platform_effect
>> 
>> I was really hoping there was an easy way to just get rid of journal and 
>> operate on filestore directly - that should suffice for anyone using RBD 
>> only  (in fact until very recently I thought it was possible to just disable 
>> journal in config...)
> 
> The biggest thing you're missing here is that Ceph needs to keep *its*
> data and metadata consistent. The filesystem journal does *not* let us
> do that, so we need a journal of our own.
> 

I get that, but I can't see any reason for the client IO to cause any change in 
this data.
Rebalancing? Maybe OK if it needs this state data. Changing CRUSH? OK, probably 
a good idea to have several copies that are checksummed and versioned and put 
somewhere super-safe.
But I see no need for client IO to pass through here, ever...


> Could something be constructed to do that more efficiently? Probably,
> with enough effort...but it's hard, and we don't have it right now,
> and it will still require a Ceph journal, because Ceph will always
> have its own metadata that needs to be kept consistent with its data.
> (Short example: rbd client sends two writes. OSDs crash and restart.
> client dies before they finish. OSDs try to reconstruct consistent
> view of the data. If OSDs don't have the right metadata about which
> writes have been applied, they can't tell who's got the newest data or
> if somebody's missing some piece of it, and without journaling you
> could get the second write applied but not the first, etc)
> 

If the writes were followed by a flush (barrier) then that blocks until the 
data (all data not flushed) is safe and durable on the disk. Whether that means 
in a journal or flushed to OSD filesystem makes no difference.
If the writes were not followed by a flush then anything can happen - there 
could be any state (like only the second write happening) and that's what the 
client _MUST_ be able to cope with, Ceph or not. It's the same as a physical 
drive - will it have the data or not after a crash? Who cares - the OS didn't 
get a confirmation so it's replayed (from filesystem journal in the guest, 
database transaction log, retried from application...).
Even if just the first write happened and then the whole cluster went down - no 
different then a power failure with local disk.
I can't see a scenario where something breaks - RBD is a block device, not a 
filesystem. The filesystem on top already has a journal and better 
understanding on what needs to be durable or not.
Until the guest VM asks for data to be durable, any state is acceptable.

You are right that without a "Ceph transaction log" it has no idea what was 
written and what wasn't - does that matter? It does not :-)
If a guest makes a write to a RBD image in a 3-replica cluster and power on all 
3 OSDs involved goes down at the same moment, what can it expect?
Did the guest get a confirmation for the write or not?
If it did then all replicas are consistent at that one moment.
If it did not then there might be different objects on those 3 OSDs - so what? 
The guest doesn't care what data is there because no disk gives that guarantee. 
All Ceph needs to do is stick to one version  (by a simple timestamp possibly) 
and replicate it. Even if it was not the "best" copy, the guest filesystem must 
and will cope with that.
You're trying to bring consistency to something that doesn't really need it by 
design. Even if you dutifully preserve every si

[ceph-users] CephFS namespace

2015-10-19 Thread Erming Pei

Hi,

   Is there a way to list the namespaces in cephfs? How to set it up?

   From man page of ceph.mount, I see this:

/To mount only part of the namespace://
//
//  mount.ceph monhost1:/some/small/thing /mnt/thing/

  But how to know the namespaces at first?

Thanks,

Erming



--
-
 Erming Pei, Ph.D
 Senior System Analyst; Grid/Cloud Specialist

 Research Computing Group
 Information Services & Technology
 University of Alberta, Canada

 Tel: +1 7804929914Fax: +1 7804921729
-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS namespace

2015-10-19 Thread Gregory Farnum
On Mon, Oct 19, 2015 at 3:06 PM, Erming Pei  wrote:
> Hi,
>
>Is there a way to list the namespaces in cephfs? How to set it up?
>
>From man page of ceph.mount, I see this:
>
> To mount only part of the namespace:
>
>   mount.ceph monhost1:/some/small/thing /mnt/thing
>
>   But how to know the namespaces at first?

"Namespace" here means "directory tree" or "folder hierarchy".
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 回复:Re: Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread louis

Hi, I am curious whether we need journal instead using file system existing journal, at least for the block use case. Can you help explain more how ceph guarantee data, file, states, leveldb update atomically by using ceph journal?发自网易邮箱大师
在2015年10月15日 02:29,Somnath Roy 写道:FileSystem like XFS guarantees a single file write but in Ceph transaction we are touching file/xattrs/leveldb (omap), so no way filesystem can guarantee that transaction. That's why FileStore has implemented a write_ahead journal. Basically, it is writing the entire transaction object there and only trimming from journal when it is actually applied (all the operation executed) and persisted in the backend.  

Thanks & Regards
Somnath

-Original Message-
From: Jan Schermer [mailto:j...@schermer.cz]  
Sent: Wednesday, October 14, 2015 9:06 AM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

But that's exactly what filesystems and their own journals do already :-)

Jan

> On 14 Oct 2015, at 17:02, Somnath Roy  wrote:
>  
> Jan,
> Journal helps FileStore to maintain the transactional integrity in the event of a crash. That's the main reason.
>  
> Thanks & Regards
> Somnath
>  
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer
> Sent: Wednesday, October 14, 2015 2:28 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
>  
> Hi,
> I've been thinking about this for a while now - does Ceph really need a journal? Filesystems are already pretty good at committing data to disk when asked (and much faster too), we have external journals in XFS and Ext4...
> In a scenario where client does an ordinary write, there's no need to flush it anywhere (the app didn't ask for it) so it ends up in pagecache and gets committed eventually.
> If a client asks for the data to be flushed then fdatasync/fsync on the filestore object takes care of that, including ordering and stuff.
> For reads, you just read from filestore (no need to differentiate between filestore/journal) - pagecache gives you the right version already.
>  
> Or is journal there to achieve some tiering for writes when the running spindles with SSDs? This is IMO the only thing ordinary filesystems don't do out of box even when filesystem journal is put on SSD - the data get flushed to spindle whenever fsync-ed (even with data="" But in reality, most of the data will hit the spindle either way and when you run with SSDs it will always be much slower. And even for tiering - there are already many options (bcache, flashcache or even ZFS L2ARC) that are much more performant and proven stable. I think the fact that people  have a need to combine Ceph with stuff like that already proves the point.
>  
> So a very interesting scenario would be to disable Ceph journal and at most use data="" on ext4. The complexity of the data path would drop significantly, latencies decrease, CPU time is saved...
> I just feel that Ceph has lots of unnecessary complexity inside that duplicates what filesystems (and pagecache...) have been doing for a while now without eating most of our CPU cores - why don't we use that? Is it possible to disable journal completely?
>  
> Did I miss something that makes journal essential?
>  
> Jan
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> 
>  
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS namespace

2015-10-19 Thread Erming Pei

I see. That's also what I needed.
Thanks.

Can we only allow a part of the 'namespace' or directory tree to be 
mounted from *server* end? Just like NFS exporting?

And even setting of permissions as well?

Erming




On 10/19/15, 4:07 PM, Gregory Farnum wrote:

On Mon, Oct 19, 2015 at 3:06 PM, Erming Pei  wrote:

Hi,

Is there a way to list the namespaces in cephfs? How to set it up?

From man page of ceph.mount, I see this:

To mount only part of the namespace:

   mount.ceph monhost1:/some/small/thing /mnt/thing

   But how to know the namespaces at first?

"Namespace" here means "directory tree" or "folder hierarchy".
-Greg



--
-
 Erming Pei, Ph.D
 Senior System Analyst; Grid/Cloud Specialist

 Research Computing Group
 Information Services & Technology
 University of Alberta, Canada

 Tel: +1 7804929914Fax: +1 7804921729
-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS namespace

2015-10-19 Thread Gregory Farnum
On Mon, Oct 19, 2015 at 3:26 PM, Erming Pei  wrote:
> I see. That's also what I needed.
> Thanks.
>
> Can we only allow a part of the 'namespace' or directory tree to be mounted
> from server end? Just like NFS exporting?
> And even setting of permissions as well?

This just got merged into the master branch, but it's not available in
any released versions. It's part of the cephx capabilities system.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How ceph client abort IO

2015-10-19 Thread min fang
Can librbd interface provide abort api for aborting IO? If yes, can the
abort interface detach write buffer immediately? I hope can reuse the write
buffer quickly after issued the abort request, while not waiting IO aborted
in osd side.

thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Minimum failure domain

2015-10-19 Thread John Wilkins
The classic case is when you are just trying Ceph out on a laptop (e.g.,
using file directories for OSDs, setting the replica size to 2, and
setting osd_crush_chooseleaf_type to 0).

The statement is a guideline. You could, in fact, create a CRUSH hierachy
consisting of OSD/journal groups within a host too. However, capturing the
host as a failure domain is preferred if you need to power down the host to
change a drive (assuming it's not hot-swappable).

There are cases with high density systems where you have multiple nodes in
the same chassis. So you might opt for a higher minimum failure domain in a
case like that.
There are also cases in larger clusters where you might have, for example,
three racks of servers with three top-of-rack switches--one for each rack.
If you want to isolate out the top of rack switch as a failure domain, you
will want to add the nodes/chassis to a rack within your CRUSH hierarchy,
and then select the rack level as your minimum failure domain. In those
scenarios, Ceph primary OSDs will replicate your copies to OSDs on
secondary nodes across chassis or racks respectively.

On Thu, Oct 15, 2015 at 1:55 PM, J David  wrote:

> In the Ceph docs, at:
>
> http://docs.ceph.com/docs/master/rados/deployment/ceph-deploy-osd/
>
> It says (under "Prepare OSDs"):
>
> "Note: When running multiple Ceph OSD daemons on a single node, and
> sharing a partioned journal with each OSD daemon, you should consider
> the entire node the minimum failure domain for CRUSH purposes, because
> if the SSD drive fails, all of the Ceph OSD daemons that journal to it
> will fail too."
>
> This, of course, makes perfect sense.  But, it got me wondering...
> under what circumstances would one *not* consider a single node to be
> the minimum failure domain for CRUSH purposes?
>
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
John Wilkins
Red Hat
jowil...@redhat.com
(415) 425-9599
http://redhat.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Jason Dillaman
> If I were to use RADOS directly in my app I'd probably rejoice at its
> capabilities and how useful and non-legacy it is, but my use is basically
> for RBD volumes with OpenStack (libvirt, qemu...). And for that those
> capabilities are unneeded.

Just to clarify, RBD does utilize librados transactions in certain cases.  An 
example includes CoW/CoR support for clones (the first op is a class method 
call to rbd.copyup which will write the provided parent image block if the 
clone object doesn't already exist; and the second op will mutate the data via 
truncate, overwrite, etc). 

-- 

Jason Dillaman 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Josh Durgin

On 10/19/2015 02:45 PM, Jan Schermer wrote:



On 19 Oct 2015, at 23:15, Gregory Farnum  wrote:

On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer  wrote:

I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what 
other people using Ceph think.

If I were to use RADOS directly in my app I'd probably rejoice at its 
capabilities and how useful and non-legacy it is, but my use is basically for 
RBD volumes with OpenStack (libvirt, qemu...). And for that those capabilities 
are unneeded.
I live in this RBD bubble so that's all I know, but isn't this also the only 
usage pattern that 90% (or more) people using Ceph care about? Isn't this what 
drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it comes to 
displacing traditional (DAS, SAN, NAS) solutions the overhead (=complexity) of 
Ceph?*

What are the apps that actually use the RADOS features? I know Swift has some 
RADOS backend (which does the same thing Swift already did by itself, maybe 
with stronger consistency?), RGW (which basically does the same as Swift?) - 
doesn't seem either of those would need anything special. What else is there?
Apps that needed more than POSIX semantics (like databases for transactions) 
already developed mechanisms to do that - how likely is my database server to 
replace those mechanisms with RADOS API and objects in the future? It's all 
posix-filesystem-centric and that's not going away.

Ceph feels like a perfect example of this 
https://en.wikipedia.org/wiki/Inner-platform_effect

I was really hoping there was an easy way to just get rid of journal and 
operate on filestore directly - that should suffice for anyone using RBD only  
(in fact until very recently I thought it was possible to just disable journal 
in config...)


The biggest thing you're missing here is that Ceph needs to keep *its*
data and metadata consistent. The filesystem journal does *not* let us
do that, so we need a journal of our own.



I get that, but I can't see any reason for the client IO to cause any change in 
this data.
Rebalancing? Maybe OK if it needs this state data. Changing CRUSH? OK, probably 
a good idea to have several copies that are checksummed and versioned and put 
somewhere super-safe.
But I see no need for client IO to pass through here, ever...



Could something be constructed to do that more efficiently? Probably,
with enough effort...but it's hard, and we don't have it right now,
and it will still require a Ceph journal, because Ceph will always
have its own metadata that needs to be kept consistent with its data.
(Short example: rbd client sends two writes. OSDs crash and restart.
client dies before they finish. OSDs try to reconstruct consistent
view of the data. If OSDs don't have the right metadata about which
writes have been applied, they can't tell who's got the newest data or
if somebody's missing some piece of it, and without journaling you
could get the second write applied but not the first, etc)



If the writes were followed by a flush (barrier) then that blocks until the 
data (all data not flushed) is safe and durable on the disk. Whether that means 
in a journal or flushed to OSD filesystem makes no difference.
If the writes were not followed by a flush then anything can happen - there 
could be any state (like only the second write happening) and that's what the 
client _MUST_ be able to cope with, Ceph or not. It's the same as a physical 
drive - will it have the data or not after a crash? Who cares - the OS didn't 
get a confirmation so it's replayed (from filesystem journal in the guest, 
database transaction log, retried from application...).
Even if just the first write happened and then the whole cluster went down - no 
different then a power failure with local disk.
I can't see a scenario where something breaks - RBD is a block device, not a 
filesystem. The filesystem on top already has a journal and better 
understanding on what needs to be durable or not.
Until the guest VM asks for data to be durable, any state is acceptable.

You are right that without a "Ceph transaction log" it has no idea what was 
written and what wasn't - does that matter? It does not :-)
If a guest makes a write to a RBD image in a 3-replica cluster and power on all 
3 OSDs involved goes down at the same moment, what can it expect?
Did the guest get a confirmation for the write or not?
If it did then all replicas are consistent at that one moment.
If it did not then there might be different objects on those 3 OSDs - so what? The guest 
doesn't care what data is there because no disk gives that guarantee. All Ceph needs to 
do is stick to one version  (by a simple timestamp possibly) and replicate it. Even if it 
was not the "best" copy, the guest filesystem must and will cope with that.
You're trying to bring consistency to something that doesn't really need it by 
design. Even if you dutifully preserve every single IO the guest did - if it 
didn't get that confirmation then it will not u

Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Jan Schermer

> On 20 Oct 2015, at 01:43, Josh Durgin  wrote:
> 
> On 10/19/2015 02:45 PM, Jan Schermer wrote:
>> 
>>> On 19 Oct 2015, at 23:15, Gregory Farnum  wrote:
>>> 
>>> On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer  wrote:
 I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear 
 what other people using Ceph think.
 
 If I were to use RADOS directly in my app I'd probably rejoice at its 
 capabilities and how useful and non-legacy it is, but my use is basically 
 for RBD volumes with OpenStack (libvirt, qemu...). And for that those 
 capabilities are unneeded.
 I live in this RBD bubble so that's all I know, but isn't this also the 
 only usage pattern that 90% (or more) people using Ceph care about? Isn't 
 this what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when 
 it comes to displacing traditional (DAS, SAN, NAS) solutions the overhead 
 (=complexity) of Ceph?*
 
 What are the apps that actually use the RADOS features? I know Swift has 
 some RADOS backend (which does the same thing Swift already did by itself, 
 maybe with stronger consistency?), RGW (which basically does the same as 
 Swift?) - doesn't seem either of those would need anything special. What 
 else is there?
 Apps that needed more than POSIX semantics (like databases for 
 transactions) already developed mechanisms to do that - how likely is my 
 database server to replace those mechanisms with RADOS API and objects in 
 the future? It's all posix-filesystem-centric and that's not going away.
 
 Ceph feels like a perfect example of this 
 https://en.wikipedia.org/wiki/Inner-platform_effect
 
 I was really hoping there was an easy way to just get rid of journal and 
 operate on filestore directly - that should suffice for anyone using RBD 
 only  (in fact until very recently I thought it was possible to just 
 disable journal in config...)
>>> 
>>> The biggest thing you're missing here is that Ceph needs to keep *its*
>>> data and metadata consistent. The filesystem journal does *not* let us
>>> do that, so we need a journal of our own.
>>> 
>> 
>> I get that, but I can't see any reason for the client IO to cause any change 
>> in this data.
>> Rebalancing? Maybe OK if it needs this state data. Changing CRUSH? OK, 
>> probably a good idea to have several copies that are checksummed and 
>> versioned and put somewhere super-safe.
>> But I see no need for client IO to pass through here, ever...
>> 
>> 
>>> Could something be constructed to do that more efficiently? Probably,
>>> with enough effort...but it's hard, and we don't have it right now,
>>> and it will still require a Ceph journal, because Ceph will always
>>> have its own metadata that needs to be kept consistent with its data.
>>> (Short example: rbd client sends two writes. OSDs crash and restart.
>>> client dies before they finish. OSDs try to reconstruct consistent
>>> view of the data. If OSDs don't have the right metadata about which
>>> writes have been applied, they can't tell who's got the newest data or
>>> if somebody's missing some piece of it, and without journaling you
>>> could get the second write applied but not the first, etc)
>>> 
>> 
>> If the writes were followed by a flush (barrier) then that blocks until the 
>> data (all data not flushed) is safe and durable on the disk. Whether that 
>> means in a journal or flushed to OSD filesystem makes no difference.
>> If the writes were not followed by a flush then anything can happen - there 
>> could be any state (like only the second write happening) and that's what 
>> the client _MUST_ be able to cope with, Ceph or not. It's the same as a 
>> physical drive - will it have the data or not after a crash? Who cares - the 
>> OS didn't get a confirmation so it's replayed (from filesystem journal in 
>> the guest, database transaction log, retried from application...).
>> Even if just the first write happened and then the whole cluster went down - 
>> no different then a power failure with local disk.
>> I can't see a scenario where something breaks - RBD is a block device, not a 
>> filesystem. The filesystem on top already has a journal and better 
>> understanding on what needs to be durable or not.
>> Until the guest VM asks for data to be durable, any state is acceptable.
>> 
>> You are right that without a "Ceph transaction log" it has no idea what was 
>> written and what wasn't - does that matter? It does not :-)
>> If a guest makes a write to a RBD image in a 3-replica cluster and power on 
>> all 3 OSDs involved goes down at the same moment, what can it expect?
>> Did the guest get a confirmation for the write or not?
>> If it did then all replicas are consistent at that one moment.
>> If it did not then there might be different objects on those 3 OSDs - so 
>> what? The guest doesn't care what data is there because no disk gives that 
>> guarantee. All Ceph 

Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-19 Thread Haomai Wang
The fact is that journal could help a lot for rbd use cases,
especially for small ios. I don' t think it will be bottleneck. If we
just want to reduce double write, it doesn't solve any performance
problem.

For rgw and cephfs, we actually need journal to keep atomic.

On Tue, Oct 20, 2015 at 8:54 AM, Jan Schermer  wrote:
>
>> On 20 Oct 2015, at 01:43, Josh Durgin  wrote:
>>
>> On 10/19/2015 02:45 PM, Jan Schermer wrote:
>>>
 On 19 Oct 2015, at 23:15, Gregory Farnum  wrote:

 On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer  wrote:
> I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear 
> what other people using Ceph think.
>
> If I were to use RADOS directly in my app I'd probably rejoice at its 
> capabilities and how useful and non-legacy it is, but my use is basically 
> for RBD volumes with OpenStack (libvirt, qemu...). And for that those 
> capabilities are unneeded.
> I live in this RBD bubble so that's all I know, but isn't this also the 
> only usage pattern that 90% (or more) people using Ceph care about? Isn't 
> this what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA 
> when it comes to displacing traditional (DAS, SAN, NAS) solutions the 
> overhead (=complexity) of Ceph?*
>
> What are the apps that actually use the RADOS features? I know Swift has 
> some RADOS backend (which does the same thing Swift already did by 
> itself, maybe with stronger consistency?), RGW (which basically does the 
> same as Swift?) - doesn't seem either of those would need anything 
> special. What else is there?
> Apps that needed more than POSIX semantics (like databases for 
> transactions) already developed mechanisms to do that - how likely is my 
> database server to replace those mechanisms with RADOS API and objects in 
> the future? It's all posix-filesystem-centric and that's not going away.
>
> Ceph feels like a perfect example of this 
> https://en.wikipedia.org/wiki/Inner-platform_effect
>
> I was really hoping there was an easy way to just get rid of journal and 
> operate on filestore directly - that should suffice for anyone using RBD 
> only  (in fact until very recently I thought it was possible to just 
> disable journal in config...)

 The biggest thing you're missing here is that Ceph needs to keep *its*
 data and metadata consistent. The filesystem journal does *not* let us
 do that, so we need a journal of our own.

>>>
>>> I get that, but I can't see any reason for the client IO to cause any 
>>> change in this data.
>>> Rebalancing? Maybe OK if it needs this state data. Changing CRUSH? OK, 
>>> probably a good idea to have several copies that are checksummed and 
>>> versioned and put somewhere super-safe.
>>> But I see no need for client IO to pass through here, ever...
>>>
>>>
 Could something be constructed to do that more efficiently? Probably,
 with enough effort...but it's hard, and we don't have it right now,
 and it will still require a Ceph journal, because Ceph will always
 have its own metadata that needs to be kept consistent with its data.
 (Short example: rbd client sends two writes. OSDs crash and restart.
 client dies before they finish. OSDs try to reconstruct consistent
 view of the data. If OSDs don't have the right metadata about which
 writes have been applied, they can't tell who's got the newest data or
 if somebody's missing some piece of it, and without journaling you
 could get the second write applied but not the first, etc)

>>>
>>> If the writes were followed by a flush (barrier) then that blocks until the 
>>> data (all data not flushed) is safe and durable on the disk. Whether that 
>>> means in a journal or flushed to OSD filesystem makes no difference.
>>> If the writes were not followed by a flush then anything can happen - there 
>>> could be any state (like only the second write happening) and that's what 
>>> the client _MUST_ be able to cope with, Ceph or not. It's the same as a 
>>> physical drive - will it have the data or not after a crash? Who cares - 
>>> the OS didn't get a confirmation so it's replayed (from filesystem journal 
>>> in the guest, database transaction log, retried from application...).
>>> Even if just the first write happened and then the whole cluster went down 
>>> - no different then a power failure with local disk.
>>> I can't see a scenario where something breaks - RBD is a block device, not 
>>> a filesystem. The filesystem on top already has a journal and better 
>>> understanding on what needs to be durable or not.
>>> Until the guest VM asks for data to be durable, any state is acceptable.
>>>
>>> You are right that without a "Ceph transaction log" it has no idea what was 
>>> written and what wasn't - does that matter? It does not :-)
>>> If a guest makes a write to a RBD image in a 3-replica cluste

[ceph-users] pgs active & remapped

2015-10-19 Thread wikison
Hi, 
I came into a strange problem I've never seen, like this:
esta@storageOne:~$ sudo ceph -s
[sudo] password for esta:
cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015
 health HEALTH_WARN
512 pgs stuck unclean
recovery 1440/2160 objects degraded (66.667%)
recovery 2160/2160 objects misplaced (100.000%)
 monmap e1: 1 mons at {monitorOne=192.168.1.153:6789/0}
election epoch 1, quorum 0 monitorOne
 osdmap e175: 8 osds: 8 up, 8 in; 512 remapped pgs
  pgmap v4258: 512 pgs, 2 pools, 2598 MB data, 720 objects
15768 MB used, 4155 GB / 4171 GB avail
1440/2160 objects degraded (66.667%)
2160/2160 objects misplaced (100.000%)
 512 active+remapped


And it is just stuck unclean. Does anybody know why?







--

Zhen Wang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does SSD Journal improve the performance?

2015-10-19 Thread Libin Wu
Hi,
My environment has 32 core CPU, and 256GB memory. The SSD can get
30k write IOPS when use directIO.

Finally, i figure out the problem, after change the scheduler of SSD to
noop, the
performance improve obviously.

Please forgive me, i don't realize IO scheduler could impact performance so
much.

Thanks!

2015-10-15 9:37 GMT+08:00 Christian Balzer :

>
> Hello,
>
> Firstly, this is clearly a ceph-users question, don't cross post to
> ceph-devel.
>
> On Thu, 15 Oct 2015 09:29:03 +0800 hzwuli...@gmail.com wrote:
>
> > Hi,
> >
> > It should be sure SSD Journal will improve the performance of IOPS. But
> > unfortunately it's not in my test.
> >
> > I have two pools with the same number of osds:
> > pool1, ssdj_sas:
> > 9 osd servers, 8 OSDs(SAS) on every server
> > Journal on SSD, one SSD disk for 4 SAS disks.
> >
> Details. All of them.
> Specific HW (CPU, RAM, etc.) of these servers and the network, what type of
> SSDs, HDDs, controllers.
>
> > pool 2, sas:
> > 9 osd servers, 8 OSDs(SAS) on every server
> > Journal on SAS disk itself。
> >
> Is the HW identical to pool1 except for the journal placement?
>
> > I use rbd to create a volume in pool1 and pool2 separately and use fio
> > to test the rand write IOPS。here is the fio configuration:
> >
> > rw=randwrite
> > ioengine=libaio
> > direct=1
> > iodepth=128
> > bs=4k
> > numjobs=1
> >
> > The result i got is:
> > volume in pool1, about 5k
> > volume in pool2, about 12k
> >
> Now this job will stress the CPUs quite a bit (which you should be able to
> see with atop or the likes).
>
> However if the HW is identical in both pools your SSD may be one of those
> that perform abysmal with direct IO.
>
> There are plenty of threads in the ML archives about this topic.
>
> Christian
>
> > It's a big gap here, anyone can give me some suggestion here?
> >
> > ceph version: hammer(0.94.3)
> > kernel: 3.10
> >
> >
> >
> > hzwuli...@gmail.com
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com