Docker Swarm Cluster Improvements

Post thumbnail

This post is part of a series.

  1. Docker Swarm with Ceph for cross-server files
  2. Upgrading Ceph in Docker Swarm
  3. Docker Swarm Cluster Improvements (This Post)

Since my previous posts about running docker-swarm with ceph, I’ve been using this fairly extensively in production and made some changes to the setup that follows on from the previous posts.

Upgrading Ceph in Docker Swarm

Post thumbnail

This post is part of a series.

  1. Docker Swarm with Ceph for cross-server files
  2. Upgrading Ceph in Docker Swarm (This Post)
  3. Docker Swarm Cluster Improvements

This post is a followup to an earlier blog bost regarding setting up a docker-swarm cluster with ceph.

I’ve been running this cluster for a while now quite happily however since setting it up, a new version of ceph has been released - nautilus - so now it’s time for some upgrades.

Note: This post is out of date now.

I would suggest looking at this post and using the docker-compose based upgrade workflow instead, up to the housekeeping part.

I’ve mostly followed https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous but adapted it for the fact we’re running everything in docker. I recommend that you have a read though this yourself first to have an idea of what we are doing and why.

(It’s worth noting at this point that this guide was mostly written after the fact based on command history so I may have missed something. It’s always a good idea to do this on a test cluster first, or in a maintenance window!)

Before we begin the upgrade, we should run the following on each node in advance to save time later: docker pull ceph/daemon:latest-nautilus

Now we can prepare to update. Firstly on any node we tell ceph not to worry about rebalancing:

ceph osd set noout

Now we can begin actually upgrading ceph. The process is actually quite simple for each daemon type, on each node we stop and remove the old container, then start a new one with the same flags we did in the past, so here we go:

On each node 1 at a time restart the ceph-mon containers:

docker stop ceph-mon; docker rm ceph-mon
docker run -d --net=host --restart always -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ \
-e MON_IP=$(ip addr show dev eth0 | grep "inet " | head -n 1 | awk '{print $2}' | awk -F/ '{print $1}') \
-e CEPH_PUBLIC_NETWORK=$(ip route show dev eth0 | grep link | grep -v 169.254.0.0 | awk '{print $1}') \
--name="ceph-mon" ceph/daemon:latest-nautilus mon

(This is basically the same command that was used before, except we’re now specifying that we want to use ceph/daemon:latest-nautilus as the image source)

After we have done this, we can check that the upgrade was successful:

ceph mon versions should show something like:

{
    "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
}

Now the same for the mgr containers:

docker stop ceph-mgr; docker rm ceph-mgr
docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ --name="ceph-mgr" --restart=always ceph/daemon:latest-nautilus mgr

Checking with ceph mgr versions

And the osd containers:

docker stop ceph-osd; docker rm ceph-osd
docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ -e OSD_DEVICE=/dev/sdb -e OSD_TYPE=disk --name="ceph-osd" --restart=always ceph/daemon:latest-nautilus osd

Checking with ceph osd versions (You might want to wait for the output of this command to show that the current node is running the new version before moving on to the next node)

Now we can move onto the MDS containers.

Firstly we need to change max_mds to 1 if it’s not already (You can check using ceph fs get cephfs):

ceph fs set cephfs max_mds 1

Now we should stop all the non-active MDSs. We can see the currently active MDS using: ceph status | grep -i mds

And we stop the non-active standby MDSs using:

docker stop ceph-mds; docker rm ceph-mds

And then once ceph status shows only the active MDS, we can restart the remaining one:

docker stop ceph-mds; docker rm ceph-mds
docker run -d --net=host --name ceph-mds --restart always -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph -e CEPHFS_CREATE=1 -e CEPHFS_DATA_POOL_PG=128 -e CEPHFS_METADATA_POOL_PG=128 ceph/daemon:latest-nautilus mds

And then restart all the standby MDSs:

docker run -d --net=host --name ceph-mds --restart always -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph -e CEPHFS_CREATE=1 -e CEPHFS_DATA_POOL_PG=128 -e CEPHFS_METADATA_POOL_PG=128 ceph/daemon:latest-nautilus mds

At this point, the max_mds value can be reset if it was previously anything other than 1.

And now we can check ceph mds versions shows our updated MDSs:

{
    "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
}

Now for some post-upgrade house keeping, on any node:

ceph osd require-osd-release nautilus
ceph osd unset noout
ceph mon enable-msgr2

We should also now update our config files and local version of ceph.

Firstly lets import our current config files into the cluster configuration db, run this on all nodes:

ceph config assimilate-conf -i /etc/ceph/ceph.conf

Then we can upgrade the local ceph tools:

rpm -e ceph-release; rpm -Uvh https://download.ceph.com/rpm-nautilus/el7/noarch/ceph-release-1-1.el7.noarch.rpm
yum clean all; yum update ceph

And update our local config to the minimal config:

cp -f /etc/ceph/ceph.conf /etc/ceph/ceph.conf.old
ceph config generate-minimal-conf > /etc/ceph/ceph.conf.new
mv -f /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf

We should also update our fstab entry to include multiple servers not just the current one, so that we can actually mount properly on startup (this should have been done in the original guide. I learned afterwards!):

export CEPHMON=`ceph mon dump 2>&1 | grep "] mon." | awk '{print $3}' | sed -r 's/mon.(.*)/\1:6789/g'`
sed -ri "s/.*(:\/\s+\/var\/data\/.*)/$(echo ${CEPHMON} | sed 's/ /,/g')\1/" /etc/fstab

This may also now be a good time for other OS updates and a reboot if required (Run ceph osd set noout first to stop ceph rebalancing when the node goes down and check ceph status to see if the current node is the active MDS and fail it if it is with ceph mds fail $(hostname -s) and then ceph osd unset noout when we’re done.)

Before rebooting we will want to drain the node of active containers:

docker node update --availability drain `hostname -f`

and then undrain it when we’re done:

docker node update --availability active `hostname -f`

And that’s it! Overall a pretty painless upgrade process, which is nice.

Docker Swarm with Ceph for cross-server files

Post thumbnail

This post is part of a series.

  1. Docker Swarm with Ceph for cross-server files (This Post)
  2. Upgrading Ceph in Docker Swarm
  3. Docker Swarm Cluster Improvements

I’ve been wanting to play with Docker Swarm for a while now for hosting containers, and finally sat down this weekend to do it.

Something that has always stopped me before now was that I wanted to have some kind of cross-site storage but I don’t have any kind of SAN storage available to me just standalone hosts. I’ve been able to work around this using ceph on the nodes.

Note: I’ve never used ceph before, I don’t really know what I’m doing with ceph, so this is all a bit of guesswork. I used Funky Penguin’s Geek Cookbook as a basis for some of this, though some things have changed since then, and I’m using base-centOS not AtomicHost (I tried AtomicHost, but wanted a newer-version of docker so switched away).

All my physical servers run Proxmox, and this is no exception. On 3 of these host nodes I created a new VM (1 per node) to be part of the cluster. These all have 3 disks, 1 for the base OS, 1 for Ceph, 1 for cloud-init (The non-cloud-init disks are all SCSI with individual iothreads).

CentOS provide a cloud-image compatible disk here that I use as the base-os. I created a disk in proxmox, then detached it and overwrote it with the centos-provided image and re-attached it. I could have used an Ubuntu cloud-image instead.

I now had 3 empty CentOS VMs ready to go.

First thing to do, is get the nodes ready for docker:

curl https://download.docker.com/linux/centos/docker-ce.repo -o /etc/yum.repos.d/docker-ce.repo
mkdir /etc/docker
echo '{"storage-driver": "overlay2"}' > /etc/docker/daemon.json
yum install docker-ce
systemctl start chronyd
systemctl enable chronyd
systemctl start docker
systemctl enable docker

And build our swarm cluster.

On the first node:

docker swarm init
docker swarm join-token manager

And then on the other 2 nodes, copy and paste the output from the last command to join to the cluster. This joins all 3 nodes as managers, and you can confirm the cluster is working like so:

[root@ds-2 ~]# docker node ls
ID                            HOSTNAME                         STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
fo6paibeunoo9sulaiqu3iuqu     ds-1.dev.shanemcc.net            Ready               Active              Leader              18.09.1
phoy6ju7ait1aew7yifiemaob *   ds-2.dev.shanemcc.net            Ready               Active              Reachable           18.09.1
eexahtaiza1saibeishu8quie     ds-3.dev.shanemcc.net            Ready               Active              Reachable           18.09.1
[root@ds-2 ~]#

And all 3 host nodes have SSH keys generated (ssh-keygen -t ed25519) and setup within /root/.ssh/authorized_keys on each node so that I can ssh between them.

Note: This section is out of date now. I would suggest deploying a newer version of ceph, and I now recommend deploying ceph using docker-compose as per this post

I’ve not tested this, but you should be able to deploy the docker-compose file from that post and start the containers from that instead of using the docker run commands below (with the exception of the one to zap the OSD)

Now we can start setting up ceph.

Even though we will be running ceph within docker containers, I’ve also installed the ceph tools on the host node for convenience:

rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
rpm -Uvh https://download.ceph.com/rpm-luminous/el7/noarch/ceph-release-1-1.el7.noarch.rpm
yum install ceph

Remove any old ceph that may be lying around:

rm -Rfv /etc/ceph
rm -Rfv /var/lib/ceph
mkdir /etc/ceph
mkdir /var/lib/ceph
chcon -Rt svirt_sandbox_file_t /etc/ceph
chcon -Rt svirt_sandbox_file_t /var/lib/ceph

On the first node, initialise a ceph monitor:

docker run -d --net=host --restart always -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ \
-e MON_IP=$(ip addr show dev eth0 | grep "inet " | head -n 1 | awk '{print $2}' | awk -F/ '{print $1}') \
-e CEPH_PUBLIC_NETWORK=$(ip route show dev eth0 | grep link | grep -v 169.254.0.0 | awk '{print $1}') \
--name="ceph-mon" ceph/daemon mon

And then copy the generated data over to the other 2 nodes:

scp -r /etc/ceph/* ds-2:/etc/ceph/
scp -r /etc/ceph/* ds-3:/etc/ceph/

And start the monitor on those also using the same command again.

Now, on all 3 nodes we can start a manager:

docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ --name="ceph-mgr" --restart=always ceph/daemon mgr

And create the OSDs on all 3 nodes (This will remove all the data from the disk provided (/dev/sdb) so be careful. The disk is given twice here):

ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
docker run --rm --privileged=true -v /dev/:/dev/ -e OSD_DEVICE=/dev/sdb ceph/daemon zap_device
docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ -e OSD_DEVICE=/dev/sdb -e OSD_TYPE=disk --name="ceph-osd" --restart=always ceph/daemon osd

Once the OSDs are finished initialising on each node (watch docker logs -f ceph-osd), we can create the MDSs on each node:

docker run -d --net=host --name ceph-mds --restart always -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph -e CEPHFS_CREATE=1 -e CEPHFS_DATA_POOL_PG=128 -e CEPHFS_METADATA_POOL_PG=128 ceph/daemon mds

And then once these are created, lets tell ceph how many copies of things to keep:

ceph osd pool set cephfs_data size 3
ceph osd pool set cephfs_metadata size 3

And there’s no point scrubbing on VM disks:

ceph osd set noscrub
ceph osd set nodeep-scrub

Now, we have a 3-node ceph cluster set up and we can mount it into the hosts. Each host will mount from itself:

mkdir /var/data
ceph auth get-or-create client.dockerswarm osd 'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
echo "$(hostname -s):6789:/      /var/data/      ceph      name=dockerswarm,secret=$(ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm),noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0 0 2" >> /etc/fstab
mount -a

Note: There are also some recommendations in this post to mount ceph from multiple nodes not just the local node.

All 3 hosts should now have a /var/data directory and files that are created on one should appear automatically on the others.

For my use-case so far, this is sufficient. I’m using files/directories within /var/data as bind mounts (not volumes) in my docker containers currently and it seems to be working. I’m planning on playing about more with this in the coming weeks to see how well it works with more real-world usage.