Thanks, this is useful in general.  I have a semi-related question:

Given an OSD server with multiple SSDs or NVME devices, is there an advantage to putting wal/db on a different device of the same speed?  For example, data on sda1, matching wal/db on sdb1,  and then data on sdb2 and wal/db on sda2?

    -- jacob


On 05/11/2018 12:46 PM, David Turner wrote:
This thread is off in left field and needs to be brought back to how things work.

While multiple OSDs can use the same device for block/wal partitions, they each need their own partition.  osd.0 could use nvme0n1p1, osd.2/nvme0n1p2, etc.  You cannot use the same partition for each osd.  Ceph-volume will not create the db/wal partitions for you, you need to manually create the partitions to be used by the OSD.  There is no need to put a filesystem on top of the partition for the wal/db.  That is wasted overhead that will slow things down.

Back to the original email.

> Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
> osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
This is what you need to do, but like said above, you need to create the partitions for --block-db yourself.  You talked about having a 10GB partition for this, but the general recommendation for block-db partitions is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you should be looking closer to a 40GB block.db partition.  If your block.db partition is too small, then once it fills up it will spill over onto the data volume and slow things down.

> And just to make sure - if I specify "--osd-db", I don't need
> to set "--osd-wal" as well, since the WAL will end up on the
> DB partition automatically, correct?
This is correct.  The wal will automatically be placed on the db if not otherwise specified.


I don't use ceph-deploy, but the process for creating the OSDs should be something like this.  After the OSDs are created it is a good idea to make sure that the OSD is not looking for the db partition with the /dev/nvme0n1p2 distinction as that can change on reboots if you have multiple nvme devices.

# Make sure the disks are clean and ready to use as an OSD
for hdd in /dev/sd{b..c}; do
  ceph-volume lvm zap $hdd --destroy
done

# Create the nvme db partitions (assuming 10G size for a 1TB OSD)
for partition in {2..3}; do
  sgdisk -c /dev/nvme0n1 -n:$partition:0:+10G -c:$partition:'ceph db'
done

# Create the OSD
echo "/dev/sdb /dev/nvme0n1p2
/dev/sdc /dev/nvme0n1p3" | while read hdd db; do
  ceph-volume lvm create --bluestore --data $hdd --block.db $db
done

# Fix the OSDs to look for the block.db partition by UUID instead of its device name.
for db in /var/lib/ceph/osd/*/block.db; do
  dev=$(readlink $db | grep -Eo nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+ || echo false)
  if [[ "$dev" != false ]]; then
    uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${dev}'$/ {print $9}')
    ln -sf /dev/disk/by-partuuid/$uuid $db
  fi
done
systemctl restart ceph-osd.target

On Fri, May 11, 2018 at 10:59 AM João Paulo Sacchetto Ribeiro Bastos <joaopaulos...@gmail.com <mailto:joaopaulos...@gmail.com>> wrote:

    Actually, if you go to
    https://ceph.com/community/new-luminous-bluestore/ you will see
    that DB/WAL work on a XFS partition, while the data itself goes on
    a raw block.

    Also, I told you the wrong command in the last mail. When i said
    --osd-db it should be --block-db.

    On Fri, May 11, 2018 at 11:51 AM Oliver Schulz
    <oliver.sch...@tu-dortmund.de
    <mailto:oliver.sch...@tu-dortmund.de>> wrote:

        Hi,

        thanks for the advice! I'm a bit confused now, though. ;-)
        I thought DB and WAL were supposed to go on raw block
        devices, not file systems?


        Cheers,

        Oliver


        On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos wrote:
        > Hello Oliver,
        >
        > As far as I know yet, you can use the same DB device for
        about 4 or 5
        > OSDs, just need to be aware of the free space. I'm also
        developing a
        > bluestore cluster, and our DB and WAL will be in the same
        SSD of about
        > 480GB serving 4 OSD HDDs of 4 TB each. About the sizes, its
        just a
        > feeling because I couldn't find yet any clear rule about how
        to measure
        > the requirements.
        >
        > * The only concern that took me some time to realize is that
        you should
        > create a XFS partition if using ceph-deploy because if you
        don't it will
        > simply give you a RuntimeError that doesn't give any hint
        about what's
        > going on.
        >
        > So, answering your question, you could do something like:
        > $ ceph-deploy osd create --bluestore --data=/dev/sdb --block-db
        > /dev/nvme0n1p1 $HOSTNAME
        > $ ceph-deploy osd create --bluestore --data=/dev/sdc --block-db
        > /dev/nvme0n1p1 $HOSTNAME
        >
        > On Fri, May 11, 2018 at 10:35 AM Oliver Schulz
        > <oliver.sch...@tu-dortmund.de
        <mailto:oliver.sch...@tu-dortmund.de>
        <mailto:oliver.sch...@tu-dortmund.de
        <mailto:oliver.sch...@tu-dortmund.de>>> wrote:
        >
        >     Dear Ceph Experts,
        >
        >     I'm trying to set up some new OSD storage nodes, now with
        >     bluestore (our existing nodes still use filestore). I'm
        >     a bit unclear on how to specify WAL/DB devices: Can
        >     several OSDs share one WAL/DB partition? So, can I do
        >
        >           ceph-deploy osd create --bluestore
        --osd-db=/dev/nvme0n1p2
        >     --data=/dev/sdb HOSTNAME
        >
        >           ceph-deploy osd create --bluestore
        --osd-db=/dev/nvme0n1p2
        >     --data=/dev/sdc HOSTNAME
        >
        >           ...
        >
        >     Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
        >     osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
        >
        >     And just to make sure - if I specify "--osd-db", I don't
        need
        >     to set "--osd-wal" as well, since the WAL will end up on the
        >     DB partition automatically, correct?
        >
        >
        >     Thanks for any hints,
        >
        >     Oliver
        >     _______________________________________________
        >     ceph-users mailing list
        > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
        <mailto:ceph-users@lists.ceph.com
        <mailto:ceph-users@lists.ceph.com>>
        > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        >
        > --
        >
        > João Paulo Sacchetto Ribeiro Bastos
        > +55 31 99279-7092 <tel:+55%2031%2099279-7092>
        >
        _______________________________________________
        ceph-users mailing list
        ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
    João Paulo Bastos
    DevOps Engineer at Mav Tecnologia
    Belo Horizonte - Brazil
    +55 31 99279-7092 <tel:+55%2031%2099279-7092>

    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to