Docker Swarm Cluster Improvements
This post is part of a series.
- Docker Swarm with Ceph for cross-server files
- Upgrading Ceph in Docker Swarm
- Docker Swarm Cluster Improvements (This Post)
Since my previous posts about running docker-swarm with ceph, I’ve been using this fairly extensively in production and made some changes to the setup that follows on from the previous posts.
1. Run ceph using docker-compose
The first main change was to start running ceph with docker-compose on the host nodes.
The main reason for this is to save me needing to look up the docker run
commands if I wanted
to recreate the containers (eg for updates).
Firstly, switch ceph into maintenance mode:
ceph osd set noout
And then stop and remove the old containers:
docker stop ceph-mds; docker stop ceph-osd; docker stop ceph-mon; docker stop ceph-mgr;
docker rm ceph-mds; docker rm ceph-osd; docker rm ceph-mon; docker rm ceph-mgr;
Install docker-compose following the installation guide, which looks something like this at the time of writing:
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
mv /usr/local/bin/docker-compose /usr/bin/docker-compose
And then create a new directory /root/ceph/
with a docker-compose.yml
file inside
that looks something like this:
---
version: '3.9'
x-ceph-default: &ceph-default
image: 'ceph/daemon:latest-nautilus'
restart: always
network_mode: host
pid: host
volumes:
- '/var/lib/ceph/:/var/lib/ceph/'
- '/etc/ceph:/etc/ceph'
services:
mds:
<< : *ceph-default
command: mds
container_name: ceph-mds
environment:
- CEPHFS_CREATE=1
- CEPHFS_DATA_POOL_PG=128
- CEPHFS_METADATA_POOL_PG=128
osd-sdb:
<< : *ceph-default
command: osd
container_name: ceph-osd-sdb
privileged: true
volumes:
- '/var/lib/ceph/:/var/lib/ceph/'
- '/etc/ceph:/etc/ceph'
- '/dev/:/dev/'
environment:
- OSD_DEVICE=/dev/sdb
- OSD_TYPE=disk
mgr:
<< : *ceph-default
command: mgr
container_name: ceph-mgr
privileged: true
mon:
<< : *ceph-default
command: mon
container_name: ceph-mon
environment:
- MON_IP=<MON IP>
- CEPH_PUBLIC_NETWORK=<PUBLIC NETWORK>
<MON IP>
should be replaced with the output fromip addr show dev eth0 | grep "inet " | head -n 1 | awk '{print $2}' | awk -F/ '{print $1}'
<PUBLIC NETWORK>
should be replaced with the output fromip route show dev eth0 | grep link | grep -v 169.254.0.0 | awk '{print $1}'
Then the ceph containers can be restarted using:
docker-compose up -d
This should be done on each node one at a time.
This makes updating easier as we can now just change the ceph-default
section and then
stop/start the containers. Eg the process to upgrade to octopus
on each node:
Firstly, edit docker-compose.yml
and change the image to be
ceph/daemon:latest-octopus
Then on each node we can run docker-compose pull
to pull down the new image, and we can run
through the upgrade process, which is similar to how we did it last time but this time we don’t need to remember the right
options for docker run
.
Start by setting noout
:
ceph osd set noout
On each node one at a time restart the mon containers
docker-compose stop mon; docker-compose up -d mon
and mgr:
docker-compose stop mgr; docker-compose up -d mgr
and osd:
docker-compose stop osd-sdb; docker-compose up -d osd-sdb
(As before, you want to wait until ceph osd versions
shows the new osd coming back and
ceph status
looks happy before moving on)
Once all 3 are done, we can enable octopus-only features:
ceph osd require-osd-release octopus
Now the mds containers are a bit different:
Firstly we need to change max_mds
to 1 if it’s not already (You can check using ceph fs
get cephfs | grep max_mds
):
ceph fs set cephfs max_mds 1
Now we should stop all the non-active MDSs. We can see the currently active MDS using: ceph status |
grep -i mds
and on the standby nodes we can do:
docker-compose stop mds;
Then we can restart the active mds:
docker-compose stop mds; docker-compose up -d mds
And once it appears as active
within ceph status
we can restart the standbys:
docker-compose up -d mds
At this point, the max_mds
value can be reset if it was previously anything other than 1.
Now unset the noout flag:
ceph osd unset noout
We can also update our crushmap
to straw2
:
ceph osd getcrushmap -o backup-crushmap
ceph osd crush set-all-straw-buckets-to-straw2
(This creates a backup that we can restore if needed with ceph osd setcrushmap -i
backup-crushmap
)
And fix the insecure global_id reclaim
warning:
ceph config set mon auth_allow_insecure_global_id_reclaim false
After making this change, our host node version of ceph may no longer be able to talk to the cluster, but
this should be easily resolved by running yum update ceph
The upgrade process from octopus
to pacific
is much the same up to the point where
we run ceph osd unset noout
there are no post-upgrade cleanups needed.
2. Run keepalived via swarm
This is somewhat of a quality-of-life change to ensure that drained nodes don’t have keepalived running.
I didn’t previously document setting up keepalived on these nodes, but I’ve now switched from running it outside of swarm, to inside swarm.
A docker-compose.yml
file similar to this:
---
version: '3.7'
x-defaults: &defaults
image: osixia/keepalived:2.0.20
cap_add:
- NET_ADMIN
networks:
- host
volumes:
- /var/data/composefiles/keepalived/fixPriority.sh:/container/run/startup/000-fixPriority.sh
deploy:
mode: global
restart_policy:
condition: any
services:
v4:
<< : *defaults
environment:
- "KEEPALIVED_VIRTUAL_IPS=#PYTHON2BASH:['<IPV4 IP>/29']"
- KEEPALIVED_UNICAST_PEERS=
- KEEPALIVED_ROUTER_ID=204
v6:
<< : *defaults
environment:
- "KEEPALIVED_VIRTUAL_IPS=#PYTHON2BASH:['<IPV6 IP>/64']"
- KEEPALIVED_UNICAST_PEERS=
- KEEPALIVED_ROUTER_ID=206
networks:
host:
external: true
<IPV4 IP>
and<IPV6 IP>
are the VIPs we want to use.
With /var/data/composefiles/keepalived/fixPriority.sh
looking like:
#!/bin/sh
PRIORITY_FROM_IP=$((255 - $(ip addr show dev ${KEEPALIVED_INTERFACE-eth0} | grep "inet " | head -n 1 | awk '{print $2}' | awk -F/ '{print $1}' | awk -F. '{print $4}')))
if [ "${PRIORITY_FROM_IP}" != "" ]; then
echo ${PRIORITY_FROM_IP} > /container/run/environment/KEEPALIVED_PRIORITY
fi;
Then this can be run similar to any other swarm service. The priorities are set based on the IPs of the host nodes.
We need to make sure that we have modprobe ip_vs
run at startup, of which the easiest way is
using /etc/rc.d/rc.local
. On my clusters this looks something like:
touch /var/lock/subsys/local
modprobe ip_vs
sed -i '/'$(hostname)'/d' /etc/hosts
mount /var/data
And then chmod a+x /etc/rc.d/rc.local
.
This also ensures our /var/data
ceph mount is mounted, and removes the pointer to 127.0.0.1 for
our hostname (which breaks our ceph mounting as we’re using our public IPs).
3. Helper scripts
I have all my docker-compose.yml
files for my different stacks/services live under
/var/data/composefiles/
in separate folders per stack.
To make (re-)running and debugging these easier, I have a helper script that is loaded into the bash profile of my host nodes that gives me a few useful commands:
runstack
and stopstack
in a directory under /var/data/composefiles/
will start/stop the stack without needing to use the full docker stack deploy ...
command.
drain
and undrain
on a node will drain it and pause/unpause ceph as appropriate to
allow for updates/reboots. servicelogs
command to look at logs for a specific running instance of
the service (because the docker service logs
command is weird and mixes the logs from different
nodes) serviceexec
command to easily jump into a specific running isntance of a container from any
node (eg serviceexec keepalived_v4 0 bash
to jump into the first running instance)
I have this script at
/var/data/.bash_common
and then this gets synced over to /root/.bash_common
periodically and then it gets loaded into the shell by adding:
if [ -f /root/.bash_common ]; then
. /root/.bash_common
fi;
to the bottom of /root/.bashrc
(I used to load it directly from /var/data/.bash_common
but this breaks the ability to login as
root easily if there are issues with the ceph volume)