Fun with Dell S4048 and ONIE

Post thumbnail

In $DayJob we make use of Dell S4048-ON Switches for 10G Top-of-Rack (ToR) switching and also sometimes 10G Aggregation/Core for smaller deployments. They’re fairly flexible devices with a high number of 10G ports, some 40Gs and they can do L3 ports and L2 ports. You can also run them either Stacked or in VLT mode for redundancy purposes.

In addition these things use ONIE (Open Network Install Environment) and can run different firmware images - though we almost exclusively run these with DNOS 9 which is the Force10 FTOS code that Dell acquired some time ago rather than DNOS 10.

One evening, I was tasked with an “emergency” build request. We had some kit being shipped to a remote PoP the following day and the intended routers were delayed, so we needed to get something quickly and temporarily in place to take a BGP Transit Feed and deliver VRRP to the rest of the kit. A spare S4048 we had lying around would do the job sufficiently for the time period needed. I figured it wouldn’t take too long to get the base config needed and get it ready to be shipped with the rest of the kit.

So I got the Datacenter to rack/cable/console it so that I could begin configuration then set aside some time in the evening to do the work.

As I was watching the switch boot up I noticed something odd. Turns out the last engineer who had used this device had chosen to install the OpenSwitch OPX ONIE firmware on it instead of the usual DNOS9 firmware. So much for my quick and easy config.

At this point, I could have just reloaded the device into the ONIE installer environment and installed DNOS9 and been done with it all. But, I had a fairly open evening, and I’d not yet really played about much with any of the alternative ONIE OSes, so armed with my Yak Sheers, I thought I’d have a look around.

(After all this, I then re-imaged the device onto our standard deployment image of DNOS9 and completed the required config work that I was supposed to be doing.)

I found the OpenSwitch OPX Configuration Guide and started having a read.

TL;DR: It’s a Debian box, use ip and /etc/network/interfaces to configure it.

So I added an IP address to 1 of the interfaces (e101-001-0 for the first 10G interface on the device) and some default routing and brought up the link, something like:

ip addr add 192.0.2.2/30 dev e101-001-0
ip route add 0.0.0.0/0 via 192.0.2.1
ip link set dev e101-001-0 up

And lo-and-behold, my switch now had internet access..

admin@OPX:~$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=1.31 ms
^C
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.313/1.313/1.313/0.000 ms
admin@OPX:~$

Now I could ssh to it and have a look around.

Logging in drops you into a fairly standard debian shell and we can learn a bit about the device:

admin@OPX:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 77
Model name:            Intel(R) Atom(TM) CPU  C2338  @ 1.74GHz
Stepping:              8
CPU MHz:               1750.071
BogoMIPS:              3500.14
Virtualization:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch epb kaiser tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm arat
admin@OPX:~$
admin@OPX:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3937         516        2460          13         961        3189
Swap:             0           0           0
admin@OPX:~$
admin@OPX:~$ df -h
Filesystem                Size  Used Avail Use% Mounted on
udev                      2.0G     0  2.0G   0% /dev
tmpfs                     394M   14M  381M   4% /run
/dev/mapper/OPX-SYSROOT1  6.8G  1.7G  4.8G  26% /
tmpfs                     2.0G     0  2.0G   0% /dev/shm
tmpfs                     5.0M     0  5.0M   0% /run/lock
tmpfs                     2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/sda4                 6.8M  2.0M  4.2M  33% /mnt/boot
/dev/sda2                 120M   13M   99M  12% /mnt/onie-boot
admin@OPX:~$

It’s got a fairly weak ATOM CPU, and 4G of RAM, approximately the same as what you’d get in a cheap £10/month VPS. Disk space is basically non-existent at less than 5GB.

Nothing to write home about here, but that’s ok - this is just the management plane, it doesn’t need to be performant. Infact, I’d be disappointed if it was, as it would be a waste in a device like this.

Lets have a look around some more with opx and see what we can see.

There are a whole bunch of opx- prefixed commands to interact with the hardware:

root@OPX:~# opx-show-
opx-show-alms             opx-show-interface        opx-show-log              opx-show-packages         opx-show-stats            opx-show-transceivers     opx-show-vrf
opx-show-env              opx-show-interface-stats  opx-show-mac              opx-show-route            opx-show-system-status    opx-show-version
opx-show-global-switch    opx-show-lag              opx-show-mirror           opx-show-sflow            opx-show-transceiver      opx-show-vlan
root@OPX:~# opx-config-
opx-config-beacon         opx-config-global-switch  opx-config-interface      opx-config-log            opx-config-mirror         opx-config-sflow          opx-config-vlan           opx-config-vxlan.py
opx-config-fanout         opx-config-hybrid-group   opx-config-lag            opx-config-mac            opx-config-route          opx-config-switch         opx-config-vrf
root@OPX:~#

The output of these seems reasonably friendly and usable:

root@OPX:~# opx-show-version
OS_NAME="OPX"
OS_VERSION="3.1.0"
PLATFORM="S4048-ON"
ARCHITECTURE="x86_64"
INTERNAL_BUILD_ID="OpenSwitch blueprint for Dell 1.0.0"
BUILD_VERSION="3.1.0.0-rc1"
BUILD_DATE="2018-12-19T12:31:44-0800"
INSTALL_DATE="2019-11-21T16:38:13+00:00"
SYSTEM_UPTIME= 28 minutes
SYSTEM_STATE= running


UPGRADED_PACKAGES=no
ALTERED_PACKAGES=no
root@OPX:~#
root@OPX:~# opx-show-transceiver
Port 1
    Present:            yes
    Type:               SFP+ 10GBASE-SR
    Vendor:             FS
    Vendor part number: SFP-10GSR-85
    Vendor revision:    0000
    Serial number:      G1234567890
    Qualified:          yes
    Temperature:        31.0 deg. C
    Temperature state:  nominal
    Voltage:            3.29099988937 V
    Voltage state:      nominal
    High power mode:    no
Port 2
    Present:            yes
    ...
Port 52
    Present:            yes
    Type:               QSFP+ 40GBASE-CR4-1.0M
    Vendor:             FS
    Vendor part number: QSFP-PC005
    Vendor revision:    4100
    Serial number:      C1234567890-1
    Qualified:          yes
    Temperature:        0.0 deg. C
    Temperature state:  nominal
    Voltage:            0.0 V
    Voltage state:      nominal
    High power mode:    yes
    ...
root@OPX:~#
root@OPX:~# opx-show-transceiver --port 1
Port 1
    Present:            yes
    Type:               SFP+ 10GBASE-SR
    Vendor:             FS
    Vendor part number: SFP-10GSR-85
    Vendor revision:    0000
    Serial number:      G1234567890
    Qualified:          yes
    Temperature:        31.0 deg. C
    Temperature state:  nominal
    Voltage:            3.29099988937 V
    Voltage state:      nominal
    High power mode:    no
root@OPX:~#
root@OPX:~# opx-ethtool e101-001-0
Settings for e101-001-0:
    Channel ID:   0
    Transceiver Status: Enable
    Media Type: SFP+ 10GBASE-SR
    Part Number: SFP-10GSR-85
    Serial Number: G1234567890
    Qualified: Yes
    Administrative State: UP
    Operational State: UP
    Supported Speed (in Mbps):  [1000, 10000]
    Auto Negotiation : off
    Configured Speed   : 10000
    Operating Speed   : False
    Duplex   : full
root@OPX:~#
root@OPX:~# opx-ethtool -e e101-001-0
Show media info for e101-001-0
...
base-pas/media/port-type = 1
base-pas/media/wavelength-pico-meters = 850000
...
base-pas/media/slot = 1
base-pas/media/port = 1
...
base-pas/media/category-string = SFP+
base-pas/media/capability = 4
base-pas/media/diag-mon-type = 104
base-pas/media/channel-count = 1
base-pas/media/type = 5
...
base-pas/media/tx-power-low-warning-threshold = -7.99970722198
base-pas/media/insertion-timestamp = 140016931634256
...
base-pas/media/display-string = SFP+ 10GBASE-SR
base-pas/media/vendor-pn = SFP-10GSR-85
base-pas/media/current-temperature = 31.0
...
root@OPX:~#
root@OPX:~# opx-show-env
Chassis
...
        Vendor name:            DELL
        Service tag:            xxxxxxx
        PPID:                           xxxxxxxxxxxxxxxxxxxx
        Platform name:
        Product name:                   S4048ON
        Hardware version:               A02
        Number of MAC addresses:        256
        Base MAC address:               00:11:22:33:44:55
Power supplies
        Slot 1
                Present:                Yes
                Operating status:       Up
                Fault type:             OK
                Vendor name:
                Service tag:            AEIOU##
                PPID:                   xxxxxxxxxxxxxxxxxxxx
                Platform name:
                Product name:
                Hardware version:               A00
                Input:                  AC
                Fan airflow:            Reverse
        Slot 2
                ...
Fan trays
        Slot 1
                Present:                Yes
                Operating status:       Up
                Fault type:             OK
                Vendor name:
                Service tag:            AEIOU##
                PPID:                   xxxxxxxxxxxxxxxxxxxx
                Platform name:
                Product name:
                Hardware version:               A00
                Fan airflow:            Reverse
        Slot 2
                ...
        Slot 3
                ...
Fans
        Fan 1, PSU slot 1
                Operating status:       Up
                Fault type:             OK
                Speed (RPM):            10320
                Speed (%):              57
        Fan 1, PSU slot 2
                ...
        Fan 1, Fan tray slot 1
                Operating status:       Up
                Fault type:             OK
                Speed (RPM):            10121
                Speed (%):              53
        Fan 2, Fan tray slot 1
                ...
        Fan 1, Fan tray slot 2
                ...
        Fan 2, Fan tray slot 2
                ...
        Fan 1, Fan tray slot 3
                ...
        Fan 2, Fan tray slot 3
                ...
Temperature sensors
        Sensor CPU board sensor, Card slot 1
                Operating status:               Up
                Fault type:                     OK
                Temperature (degrees C):        31
        Sensor NPU board sensor, Card slot 1
                Operating status:               Up
                Fault type:                     OK
                Temperature (degrees C):        35
        Sensor system-NIC board sensor 1, Card slot 1
                Operating status:               Up
                Fault type:                     OK
                Temperature (degrees C):        33
        Sensor system-NIC board sensor 2, Card slot 1
                Operating status:               Up
                Fault type:                     OK
                Temperature (degrees C):        31
        Sensor NPU temp sensor, Card slot 1
                Operating status:               Up
                Fault type:                     OK
                Temperature (degrees C):        48
root@OPX:~#

Ok, so we’ve got basic connectivity, but what about if we wanted to do more, like BGP?

The configuration guide says:

Use the apt-get install command to install the latest Debian 9 (stretch) release of the FRR package.

apt you say…

The guide suggested installing the .deb by hand, but I figured it would probably work properly via apt:

apt-get update
apt-get install apt-transport-https
curl -s https://deb.frrouting.org/frr/keys.asc | sudo apt-key add -
export FRRVER="frr-stable"
echo deb https://deb.frrouting.org/frr stretch $FRRVER | sudo tee -a /etc/apt/sources.list.d/frr.list
apt-get update
apt-get install frr

And it actually installed.

At this point, a normal network-person would have then probably continued to look at frr and getting it working (I’m sure it works reasonably well, I didn’t look).

I’m not a normal network-person. I also like to play about with servers as well.

So armed with the knowledge that apt worked… I decided to try installing docker… because of course that’s the next thing you try to install on a network switch.

curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
echo "deb [arch=amd64] https://download.docker.com/linux/debian stretch stable" | sudo tee -a /etc/apt/sources.list.d/docker.list
apt-get update
apt-get install docker-ce docker-ce-cli containerd.io

And it worked. Docker was installed. And seemingly working.

root@OPX:~# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
root@OPX:~#

So the next obvious thing, what can I run to test this?

How about… this blog?

root@OPX:~# docker run shanemcc/blog.dataforce.org.uk
Unable to find image 'shanemcc/blog.dataforce.org.uk:latest' locally
latest: Pulling from shanemcc/blog.dataforce.org.uk
cbdbe7a5bc2a: Pull complete
c554c602ff32: Pull complete
eda7f6504221: Pull complete
08afec60697d: Pull complete
Digest: sha256:fd3c2e1d0a8ab6e9af30f4293135cffa2dba644aded797fe79188307f2ae0a2d
Status: Downloaded newer image for shanemcc/blog.dataforce.org.uk:latest

Well, it seemed to be running:

root@OPX:~# docker ps
CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS               NAMES
02399b6f09b9        shanemcc/blog.dataforce.org.uk   "nginx -g 'daemon of…"   55 seconds ago      Up 53 seconds       80/tcp              pensive_kapitsa
root@OPX:~#

But didn’t seem to actually work. Maybe it was too good to be true?

Oh wait - the networking on this is probably a bit weird, maybe the docker bridge/NAT stuff doesn’t work… What if we try host-based networking?

root@OPX:~# docker run --rm --network host --name shaneblogtest shanemcc/blog.dataforce.org.uk
192.0.2.253 - - [29/Oct/2020:20:09:49 +0000] "GET / HTTP/1.1" 200 32706 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.59 Safari/537.36" "-"
192.0.2.253 - - [29/Oct/2020:20:09:49 +0000] "GET /css/allStyles-b2de97faf57b5af84d20b6bbcd1f47ab.css HTTP/1.1" 200 25159 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.59 Safari/537.36" "-"
192.0.2.253 - - [29/Oct/2020:20:09:49 +0000] "GET /wp-content/uploads/2016/05/header.png HTTP/1.1" 200 7938 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.59 Safari/537.36" "-"
192.0.2.253 - - [29/Oct/2020:20:09:49 +0000] "GET /wp-content/uploads/2016/05/ShaneNewColour.png HTTP/1.1" 200 5866 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.59 Safari/537.36" "-"
...

That worked, and then I was able to see this blog in all it’s wonder, served from a switch!

Blog running on a switch

(Some of you will note that I didn’t actually expose a port properly in the first command, so it may well have worked if I’d done it correctly, I didn’t try any further)

I was greatly amused at the idea of this, mainly because it’s so stupid (running the blog on a £3k Switch that’s no more powerful than a £10/month VPS).

But also thinking about it more, this is quite exciting.

ONIE/OPX can run on x86 hardware or in a VM with KVM/QEMU/VAGRANT etc so you can actually have local test environments that function similarly to your live production switches, and with docker you can run applications on these devices to handle configuration/automation etc and get all the advantages of a modern development pipeline with reproduceable builds and an easy installation process (docker run ...).

Or you could run a blog. ¯\_(ツ)_/¯

Cisco XConnect L2Protocol Handling

Post thumbnail

In $DayJob we make fairly extensive use of MPLS ATOM Pseudowires (XConnects) between our various datacenter locations to enable services in different sites to talk to each other at layer2.

The way I describe this to customers is that in essence these act as a “long cable” from Point-A to Point-B. The customer gets a cable at each side to connect to their kit, but in the middle of it there is magic that routes the packets over our network rather than an actual long-cable. Packets that enter 1 side will be pushed out the other side, and vice-versa. We don’t need to know or care what these packets are, we are just transparently transporting them.

As a quick primer, imagine the following network:

Sample Base Network

This fictional network has 2 main sites, York and Manchester, and 2 smaller sites at Leeds and Birmingham, they have 4 individual L2 circuits between the sites forming a ring, and have routers that are MPLS capable and configured appropriately.

A new customer in each site wants layer-2 connectivity between their devices. In the past if we had connected switches at each location we may have provided spanned-VLANs (with QinQ) through the sites, but instead now we can provide this using MPLS XConnects which will be transparent to the customer. We provision 2 of these for redundancy on different devices at each side, and we end up with something like this:

Sample Network With XConnects

The customer has 2 services, Green and Blue, and they are able to connect their switches to them and everything works as if the 2 devices were directly connected. The customer is unaware of the Leeds/Birmingham devices as the provider network is transparent and everything including things such as CDP/LLDP/STP/LACP are all happily transported from site to site. The customer doesn’t see our network, and can treat these 2 cables as they see fit (such as running LACP over the top). The customer is happy.

Back at $DayJob we use a mixture of devices to do this depending on the age of the site and how long the services have been in place for.

In our case a number of these are provisioned between pairs of Cisco 7600 devices, although as we have been phasing these out we have been moving towards using ASR920s instead for newer connections. As we deploy these and phase out the 7600s, we normally provide customers with new XConnects on the new ASR920s, and then move them across to them and remove the old one, this results in most of these XConnects being between devices of the same family. We have some cross-family (920 to 7600) XConnects, but these are few and far between and we had never really noticed any issues with them.

However, one day a few days after some emergency maintenance work to decommission a failing 7600 device and move the XConnect services on it onto an ASR920, I started to notice some of our transcontinental links had developed an unusual and unexpected traffic pattern. A link that was normally fairly quiet in Asia started looking like this:

Christmas Tree Network

Traffic would slowly creep up and up and up, then reset a bit then keep going, different links were seeing different levels of traffic, but eventually over time these links would all start to fill up eventually getting closer and closer to maxing out the links if left alone.

Looking at the various links that had developed this pattern, I was able to narrow down which customer network was having the issue and noticed that it had started around the time we had replaced the 7600. I realised it must be related to the maintenance work and discovered specifically that there was ports with XConnect configs on them from the new ASR920 to remote 7600s. If I shut down one of the ports, the traffic completely vanished. And then started again once it was unshut. (As seen on the above graph.).

Looking more at this - at first glance the XConnects appeared to behave fine (they were all showing up, and traffic was clearly passing across them) there was a subtle underlying problem: On these cross-family XConnects, certain important L2Protocols (such as CDP, LACP and Spanning-Tree BPDUs) were behaving unidirectionally.

What we were seeing was that these L2Protocol packets when sent from devices at the 7600 side were successfully reaching devices the ASR920 side, but were not successfully transiting the other direction.

So much for my transparent “really long cable”.

Without these important packets working in all directions, we had ended up with a network loop on this customer platform and a broadcast storm that was going all the way round our global network from London to Tokyo to San Francisco to Virginia and back to London.

Thankfully because the loop was going the long-way-round, the speed of light was able to prevent the storm growing too quickly resulting in the fun traffic graphs. Unfortunately, shutting down a port every 12 hours is not a solution, and in this case I didn’t have the option of converting all of these XConnects into same-family XConnects due to the availability of ASR920 ports in the various sites - so we needed to get to the bottom of exactly what was happening.

I got some kit together and started to lab it up. A couple of switches, an old 7600 and an ASR920. MPLS between the 7600 and ASR920 and then build a simple XConnect between the 2 switches:

Lab Setup

This was able to reproduce the issue quite nicely. One side could see the other over CDP, the other side could not.

So now that I could reproduce it, I started to look into more details about the differences in the 2 devices. We’re doing simple whole-port based XConnects here so the config is fairly straight forward.

On the 7600, we have something like:

interface GigabitEthernet1/1
  mtu 9216
  no ip address
  no keepalive
  xconnect 10.255.0.2 100 encapsulation mpls
!

Nice and simple. We set the MTU on the port to 9216 (to allow us to receive and transport full 1500 and 9000 byte frames), and tell it to set up an XConnect to the other device with the circuit ID of 100 and encapsulate this via the MPLS network.

On the ASR920, we have something like:

interface GigabitEthernet0/0/1
  mtu 9216
  service instance 1 ethernet
    encapsulation default
    l2protocol tunnel
    xconnect 10.255.0.1 100 encapsulation mpls
  !
!

As you can see, there is a little bit more to this config, but this is mostly due to how this product is designed to be used.

We’re defining here a service instance and then we’re using encapsulation default to tell the router that it should use this for any traffic that is not matched by any other service instance on this port (We can have other service instance blocks that match different types of traffic, eg encapsulation untagged for all non-VLAN traffic, or encapsulation dot1q 1234 to match traffic tagged with vlan 1234 etc). We’re also specifying handling of l2protocol traffic here and telling the device that we want to tunnel it to the other side, this is similar to older switches when doing l2protocol-tunnel when doing QinQ.

So these 2 config blocks in isolation seem fine, and when paired with an identical configuration at the other side - everything works as expected. Alas when paired with each other, they do not.

So, blinkers on based on the fact this config worked between devices of the same type, I started looking into this and attempting to make it work.

I tried changing the firmware on the ASR920s in case there was some issue with different versions. We’d not noticed problems before, and definitely had some cross-family XConnects from the early deployments before we had ASR920s in more of our sites, so maybe something had broken at some point. Seemed reasonable.

I tried both older and newer versions of the firmware. Nope. The problem persisted.

Newer versions are slightly more verbose about their l2protocol command, and will display the config something like: l2protocol tunnel cdp stp vtp pagp dot1x lldp lacp udld - but this doesn’t seem to actually change anything.

So I then tried a variety of different ways of building the XConnects. The 7600 side was pretty set-in-stone, but the ASR920 has a few other ways we could try:

interface GigabitEthernet0/0/1
  mtu 9216
!
l2vpn xconnect context 100
  member GigabitEthernet0/0/1
  member 10.255.0.1 100 encapsulation mpls
!

or:

interface GigabitEthernet0/0/1
  mtu 9216
!
l2vpn xconnect context 100
  member GigabitEthernet0/0/1
  member Pseudowire100
!
interface Pseudowire 100
  encapsulation mpls
  neighbor 10.255.0.1 100
!

Nope. These options didn’t behave either - I also didn’t really like them as it splits the config up too much in the show running-config output - you can’t easily see that Gi0/0/1 is being used as an XConnect just from looking at it.

So I went back to the original config.

Given that L2Protocol traffic entering ASR920 was what wasn’t working, and this was the only side that we were specifically calling out the l2protocol handling, I started looking more at that.

Removing that line didn’t help, it made things worse between 2 ASR920s as no l2protocol traffic passed at all, so it was definitely required. So I looked at other options for this command. As it happens tunnel is not the only option here on the ASR920s, we also have drop, forward and peer.

Neither drop or peer seemed useful, so I changed my config from tunnel to forward and suddenly everything started behaving.

So now my ASR920 config looked like:

interface GigabitEthernet0/0/1
  mtu 9216
  service instance 1 ethernet
    encapsulation default
    l2protocol forward
    xconnect 10.255.0.1 100 encapsulation mpls
  !
!

And my 2 switches were finally able to speak CDP to each other, and sent STP BPDUs.

Turns out what was happening was that the 7600 side just forwards the l2protocols on to the remote side without doing anything with them, however on the ASR920 side you have to specify what to do with them. We had used l2protocol tunnel on these as this was similar to the l2protocol-tunnel command we had used on older devices doing QinQ that were the first ones we replaced with ASR920s (in pairs) - this worked fine and thus became part of our standard config.

So with this config, the 7600 would receive l2protocol packets and forward them onto the ASR920. When the ASR920 received the forwarded l2protocol packets from the 7600s they happily passed them out the port and everything worked as expected. However when the ASR290 received them inbound, they modified them and encapsulated them for tunneling before forwarding them onto their partner (this would be necessary if we weren’t doing MPLS and the link between our 2 devices was a switched L2 network to stop them being processed there). If this partner was another ASR920 configured the same way it would be expecting this and would unmodify/de-encapsulate them before forwarding them on and everything worked fine. However the 7600 was not expecting them in this format, and just forwarded them on as-is and the customer devices then didn’t understand what they were seeing and ignored them. Changing to l2protocol forward causes the ASR920s to behave in the same way as the 7600s and everything is happy.

Having figured this out, I went back to the recently-replaced ASR920 and dutifully changed the config on any of the xconnects that were facing 7600s to be l2protocol forward - and low-and-behold my christmas-tree graphs immediately ceased across the board.

So why hadn’t we seen this before despite having cross-family XConnects elsewhere? Looking at the limited instances we had of this, I think we just got lucky. Either the customer only had a single XConnect, or wasn’t using them for switches, or it happened to be unidirectional in the right way and STP blocked the port correctly.

Goes to show that even though something looks like it might be working fine doesn’t always mean it is, there may still be subtle parts of it that are not

Thankfully I was able to get to the bottom of this one.

Upgrading Ceph in Docker Swarm

Post thumbnail

This post is a followup to an earlier blog bost regarding setting up a docker-swarm cluster with ceph.

I’ve been running this cluster for a while now quite happily however since setting it up, a new version of ceph has been released - nautilus - so now it’s time for some upgrades.

I’ve mostly followed https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous but adapted it for the fact we’re running everything in docker. I recommend that you have a read though this yourself first to have an idea of what we are doing and why.

(It’s worth noting at this point that this guide was mostly written after the fact based on command history so I may have missed something. It’s always a good idea to do this on a test cluster first, or in a maintenance window!)

Before we begin the upgrade, we should run the following on each node in advance to save time later: docker pull ceph/daemon:latest-nautilus

Now we can prepare to update. Firstly on any node we tell ceph not to worry about rebalancing:

ceph osd set noout

Now we can begin actually upgrading ceph. The process is actually quite simple for each daemon type, on each node we stop and remove the old container, then start a new one with the same flags we did in the past, so here we go:

On each node 1 at a time restart the ceph-mon containers:

docker stop ceph-mon; docker rm ceph-mon
docker run -d --net=host --restart always -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ \
-e MON_IP=$(ip addr show dev eth0 | grep "inet " | head -n 1 | awk '{print $2}' | awk -F/ '{print $1}') \
-e CEPH_PUBLIC_NETWORK=$(ip route show dev eth0 | grep link | grep -v 169.254.0.0 | awk '{print $1}') \
--name="ceph-mon" ceph/daemon:latest-nautilus mon

(This is basically the same command that was used before, except we’re now specifying that we want to use ceph/daemon:latest-nautilus as the image source)

After we have done this, we can check that the upgrade was successful:

ceph mon versions should show something like:

{
    "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
}

Now the same for the mgr containers:

docker stop ceph-mgr; docker rm ceph-mgr
docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ --name="ceph-mgr" --restart=always ceph/daemon:latest-nautilus mgr

Checking with ceph mgr versions

And the osd containers:

docker stop ceph-osd; docker rm ceph-osd
docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ -e OSD_DEVICE=/dev/sdb -e OSD_TYPE=disk --name="ceph-osd" --restart=always ceph/daemon:latest-nautilus osd

Checking with ceph osd versions (You might want to wait for the output of this command to show that the current node is running the new version before moving on to the next node)

Now we can move onto the MDS containers.

Firstly we need to change max_mds to 1 if it’s not already (You can check using ceph fs get cephfs):

ceph fs set cephfs max_mds 1

Now we should stop all the non-active MDSs. We can see the currently active MDS using: ceph status | grep -i mds

And we stop the non-active standby MDSs using:

docker stop ceph-mds; docker rm ceph-mds

And then once ceph status shows only the active MDS, we can restart the remaining one:

docker stop ceph-mds; docker rm ceph-mds
docker run -d --net=host --name ceph-mds --restart always -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph -e CEPHFS_CREATE=1 -e CEPHFS_DATA_POOL_PG=128 -e CEPHFS_METADATA_POOL_PG=128 ceph/daemon:latest-nautilus mds

And then restart all the standby MDSs:

docker run -d --net=host --name ceph-mds --restart always -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph -e CEPHFS_CREATE=1 -e CEPHFS_DATA_POOL_PG=128 -e CEPHFS_METADATA_POOL_PG=128 ceph/daemon:latest-nautilus mds

At this point, the max_mds value can be reset if it was previously anything other than 1.

And now we can check ceph mds versions shows our updated MDSs:

{
    "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
}

Now for some post-upgrade house keeping, on any node:

ceph osd require-osd-release nautilus
ceph osd unset noout
ceph mon enable-msgr2

We should also now update our config files and local version of ceph.

Firstly lets import our current config files into the cluster configuration db, run this on all nodes:

ceph config assimilate-conf -i /etc/ceph/ceph.conf

Then we can upgrade the local ceph tools:

rpm -e ceph-release; rpm -Uvh https://download.ceph.com/rpm-nautilus/el7/noarch/ceph-release-1-1.el7.noarch.rpm
yum clean all; yum update ceph

And update our local config to the minimal config:

cp -f /etc/ceph/ceph.conf /etc/ceph/ceph.conf.old
ceph config generate-minimal-conf > /etc/ceph/ceph.conf.new
mv -f /etc/ceph/ceph.conf.new /etc/ceph/ceph.conf

We should also update our fstab entry to include multiple servers not just the current one, so that we can actually mount properly on startup (this should have been done in the original guide. I learned afterwards!):

export CEPHMON=`ceph mon dump 2>&1 | grep "] mon." | awk '{print $3}' | sed -r 's/mon.(.*)/\1:6789/g'`
sed -ri "s/.*(:\/\s+\/var\/data\/.*)/$(echo ${CEPHMON} | sed 's/ /,/g')\1/" /etc/fstab

This may also now be a good time for other OS updates and a reboot if required (Run ceph osd set noout first to stop ceph rebalancing when the node goes down and check ceph status to see if the current node is the active MDS and fail it if it is with ceph mds fail $(hostname -s) and then ceph osd unset noout when we’re done.)

Before rebooting we will want to drain the node of active containers:

docker node update --availability drain `hostname -f`

and then undrain it when we’re done:

docker node update --availability active `hostname -f`

And that’s it! Overall a pretty painless upgrade process, which is nice.

Fun with TOTP Codes

Post thumbnail

This all started with a comment I overheard at work from a colleague talking about a 2FA implementation on a service they were using.

“It works fine on everything except Google Authenticator on iPhone.”

… What? This comment alone immediately piqued my interest, I stopped what I was doing, turned round, and asked him to explain.

He explained that a service he was using provided 2FA support using TOTP codes. As is normal, they provided a QR Code, you scanned it with your TOTP application (Google Authenticator or Authy or so), then you typed in the verification code - and it worked for both Google Authenticator and Authy on his Android phone, but only with Authy and not Google Authenticator on another colleagues iPhone.

This totally nerd sniped me, and I just had to take a look.

The first thing I tried was to look at some “known-good” codes. I support RFC 6238 TOTP for MyDNSHost so I started there, and looked to generate a new code on a test account. Alas, in the dev install I was using, I had broken TOTP 2FA Codes so couldn’t use it test, so Googled for a site to generate the images for me, and came across: https://stefansundin.github.io/2fa-qr/

I generated a Test QR Code, scanned it into Authy on my Android phone, and Google Authenticator on my colleagues iPhone - and they both agreed on the code, and the next one, and so on.

We then copied the code from service we were using and pasted that to the generator and scanned the new QR code in… and it also worked fine. Interesting.

So, the next thing to do was to to compare the difference between the URLs. QR Codes for TOTP are actually just text that looks somewhat like: otpauth://totp/TestService?secret=TESTTEST (Key URI Format)

So looking at the 2 QR Codes:

  • Generated QR Code: otpauth://totp/TestService?secret=LJZC6S3XHFHHMMDXNBJC4LDBJYZCMU351
  • Service QR Code: otpauth://totp/TestService?secret=LJZC6S3XHFHHMMDXNBJC4LDBJYZCMU35&algorithm=SHA5121

Interesting! The service was doing something different, it seemed to be suggesting that a different algorithm should be used, this was not something I was aware of so I then looked at RFC 6238 to see what it had to say about the algorithms, it states:

TOTP implementations MAY use HMAC-SHA-256 or HMAC-SHA-512 functions,

based on SHA-256 or SHA-512 [SHA2] hash functions, instead of the

HMAC-SHA-1 function that has been specified for the HOTP computation

in [RFC4226].

So this was valid after all… Was the iPhone doing something wrong? I couldn’t find any bug reports suggesting as much from some cursory googling.

Looking back at the web-based generator website, it has an “advanced options” field which lets us change the algorithm in the generated code, so I made some test QR Codes, all with the same secret, but 1 of each algorithm (SHA1, SHA256, SHA512).

I then imported all 3 into Google Authenticator on both Android and a spare iPhone and took a look at the output:

Phones showing TOTP Codes

Ah… no, it does not look like it’s the iPhone at fault here. Infact it very much appears like the opposite2, it appears that the Google Authenticator app on iPhone is the only one that correctly cares about the algorithm provided. Google Authenticator on Android and Authy on either Android or iPhone all appear to just ignore the Algorithm param and default to SHA1.

It also even looks like the service that was providing these codes was not validating it correctly, and also was expecting the SHA1 code despite asking for SHA512.

This looked like the end of it, but I wanted to be sure. I decided to throw together a quick php script to test the theory. I normally use PHPGangsta/GoogleAuthenticator for my GoogleAuthenticator validation, so I set about modifying that to support the different algorithms (Modified code is available here), and then produced this test script3:

<?php
	require_once(__DIR__ . '/PHPGangsta-GoogleAuthenticator/PHPGangsta/GoogleAuthenticator.php');

	$ga = new PHPGangsta_GoogleAuthenticator();
	$ga->setCodeLength(6);

	$secret = 'LJZC6S3XHFHHMMDXNBJC4LDBJYZCMU35';
	$time = floor(time() / 30);
	$time = '51793295'; // Comment this out for real-time codes.

	echo 'Time: ', $time, "\n\n";
	echo 'Code SHA1: ', $ga->getCode($secret, $time, 'SHA1'), "\n";
	echo 'Code SHA256: ', $ga->getCode($secret, $time, 'SHA256'), "\n";
	echo 'Code SHA512: ', $ga->getCode($secret, $time, 'SHA512'), "\n";
	echo "\n";

I ran the script, and compared it’s output to the phones - The script agreed with the iPhone:

$ php test.php
Time: 51793295

Code SHA1: 583328
Code SHA256: 972899
Code SHA512: 911582

$

I’ve also created a demo page here that displays 3 qr codes (1 for each algorithm, all with the same secret) and their expected output to allow people to reproduce this on their own devices.

So that’s that4. Looks like the reason it works on everything except Google Authenticator on iPhone… is because everything else is wrong.

Update 1: Looks like there is a bug report for Google Authenticator on Android for this here


  1. This TOTP code is not actually used live anywhere, and is for demonstration purposes only. ↩︎

  2. This was painful for me to admit out loud to my colleague… ↩︎

  3. For demonstration purposes this script uses a fixed timeslice to match up with the earlier picture. ↩︎

  4. Yes, I have been in contact with the service in question to point out the problem to them. ↩︎

Docker Swarm with Ceph for cross-server files

Post thumbnail

I’ve been wanting to play with Docker Swarm for a while now for hosting containers, and finally sat down this weekend to do it.

Something that has always stopped me before now was that I wanted to have some kind of cross-site storage but I don’t have any kind of SAN storage available to me just standalone hosts. I’ve been able to work around this using ceph on the nodes.

Note: I’ve never used ceph before, I don’t really know what I’m doing with ceph, so this is all a bit of guesswork. I used Funky Penguin’s Geek Cookbook as a basis for some of this, though some things have changed since then, and I’m using base-centOS not AtomicHost (I tried AtomicHost, but wanted a newer-version of docker so switched away).

All my physical servers run Proxmox, and this is no exception. On 3 of these host nodes I created a new VM (1 per node) to be part of the cluster. These all have 3 disks, 1 for the base OS, 1 for Ceph, 1 for cloud-init (The non-cloud-init disks are all SCSI with individual iothreads).

CentOS provide a cloud-image compatible disk here that I use as the base-os. I created a disk in proxmox, then detached it and overwrote it with the centos-provided image and re-attached it. I could have used an Ubuntu cloud-image instead.

I now had 3 empty CentOS VMs ready to go.

First thing to do, is get the nodes ready for docker:

curl https://download.docker.com/linux/centos/docker-ce.repo -o /etc/yum.repos.d/docker-ce.repo
mkdir /etc/docker
echo '{"storage-driver": "overlay2"}' > /etc/docker/daemon.json
yum install docker-ce
systemctl start chronyd
systemctl enable chronyd
systemctl start docker
systemctl enable docker

And build our swarm cluster.

On the first node:

docker swarm init
docker swarm join-token manager

And then on the other 2 nodes, copy and paste the output from the last command to join to the cluster. This joins all 3 nodes as managers, and you can confirm the cluster is working like so:

[root@ds-2 ~]# docker node ls
ID                            HOSTNAME                         STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
fo6paibeunoo9sulaiqu3iuqu     ds-1.dev.shanemcc.net            Ready               Active              Leader              18.09.1
phoy6ju7ait1aew7yifiemaob *   ds-2.dev.shanemcc.net            Ready               Active              Reachable           18.09.1
eexahtaiza1saibeishu8quie     ds-3.dev.shanemcc.net            Ready               Active              Reachable           18.09.1
[root@ds-2 ~]#

Even though we will be running ceph within docker containers, I’ve also installed the ceph tools on the host node for convenience:

rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
rpm -Uvh https://download.ceph.com/rpm-luminous/el7/noarch/ceph-release-1-1.el7.noarch.rpm
yum install ceph

And all 3 host nodes have SSH keys generated (ssh-keygen -t ed25519) and setup within /root/.ssh/authorized_keys on each node so that I can ssh between them.

Now we can start setting up ceph.

Remove any old ceph that may be lying around:

rm -Rfv /etc/ceph
rm -Rfv /var/lib/ceph
mkdir /etc/ceph
mkdir /var/lib/ceph
chcon -Rt svirt_sandbox_file_t /etc/ceph
chcon -Rt svirt_sandbox_file_t /var/lib/ceph

On the first node, initialise a ceph monitor:

docker run -d --net=host --restart always -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ \
-e MON_IP=$(ip addr show dev eth0 | grep "inet " | head -n 1 | awk '{print $2}' | awk -F/ '{print $1}') \
-e CEPH_PUBLIC_NETWORK=$(ip route show dev eth0 | grep link | grep -v 169.254.0.0 | awk '{print $1}') \
--name="ceph-mon" ceph/daemon mon

And then copy the generated data over to the other 2 nodes:

scp -r /etc/ceph/* ds-2:/etc/ceph/
scp -r /etc/ceph/* ds-3:/etc/ceph/

And start the monitor on those also using the same command again.

Now, on all 3 nodes we can start a manager:

docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ --name="ceph-mgr" --restart=always ceph/daemon mgr

And create the OSDs on all 3 nodes (This will remove all the data from the disk provided (/dev/sdb) so be careful. The disk is given twice here):

ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring
docker run --rm --privileged=true -v /dev/:/dev/ -e OSD_DEVICE=/dev/sdb ceph/daemon zap_device
docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ -e OSD_DEVICE=/dev/sdb -e OSD_TYPE=disk --name="ceph-osd" --restart=always ceph/daemon osd

Once the OSDs are finished initialising on each node (watch docker logs -f ceph-osd), we can create the MDSs on each node:

docker run -d --net=host --name ceph-mds --restart always -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph -e CEPHFS_CREATE=1 -e CEPHFS_DATA_POOL_PG=128 -e CEPHFS_METADATA_POOL_PG=128 ceph/daemon mds

And then once these are created, lets tell ceph how many copies of things to keep:

ceph osd pool set cephfs_data size 2
ceph osd pool set cephfs_metadata size 2

And there’s no point scrubbing on VM disks:

ceph osd set noscrub
ceph osd set nodeep-scrub

Now, we have a 3-node ceph cluster set up and we can mount it into the hosts. Each host will mount from itself:

mkdir /var/data
ceph auth get-or-create client.dockerswarm osd 'allow rw' mon 'allow r' mds 'allow' > /etc/ceph/keyring.dockerswarm
echo "$(hostname -s):6789:/      /var/data/      ceph      name=dockerswarm,secret=$(ceph-authtool /etc/ceph/keyring.dockerswarm -p -n client.dockerswarm),noatime,_netdev,context=system_u:object_r:svirt_sandbox_file_t:s0 0 2" >> /etc/fstab
mount -a

All 3 hosts should now have a /var/data directory and files that are created on one should appear automatically on the others.

For my use-case so far, this is sufficient. I’m using files/directories within /var/data as bind mounts (not volumes) in my docker containers currently and it seems to be working. I’m planning on playing about more with this in the coming weeks to see how well it works with more real-world usage.

Advent of Code Benchmarking

Post thumbnail

For a few years now I’ve been enjoying Eric Wastl‘s Advent of Code. For those unaware, each year since 2015 Advent of Code provides a 2-part coding challenge every day from December 1st to December 25th.

In previous years, Myself and Chris have been fairly informally trying to see who was able to produce the fastest code (Me in PHP, Chris in Python). In the final week of last year to assist with this, we both made our repos run in Docker and produce time output for each day.

This allowed us to run each other’s code locally to compare fairly without needing to install the other’s dev environment, and made the testing a bit fairer as it was no longer dependant on who had the faster CPU when running their own solution. For the rest of the year this was fine and we carried on as normal. As we got to the end I remarked it would be fun to have a web interface that automatically dealt with it and showed us the scores, but there was obviously no point in doing that once the year was over. Maybe in a future year…

Fast forward to this year. Myself and Chris (and ChrisN) coded up our Day 1 solutions as normal and then some other friends started doing it for the first time. I remembered my plans from the previous year and suggested everyone should also docker-ify their repos… and so they agreed

Now, I’m not one who is lacking in side-projects, but with everyone making their code able to run with a reasonably-similar docker interface, and the first couple of days not yet fully scratching the coding-itch, I set about writing what I now call AoCBench.

The idea was simple:

  • Check out (or update) code
  • Build docker container
  • Run each day multiple times and store time output
  • Show fastest time for each person/day in a table.

And the initial version did exactly that. So I fired up an LXC container on one of my servers and set it off to start running benchmarks and things were good.

AoCBench Main Page

Pretty quickly the first problem became obvious - it was running everything every time which as I added more people really slowed things down, so the next stage was to make it only run when code changed.

In the initial version, the fastest time from 10 runs was the time that was used for the benchmark. But some solutions had wildly-varying times and sometimes “got lucky” with a fast run which unfairly skewed the results. We tried using mean times. Then we tried running the benchmarks more often to see if this resulted in more-alike times. I even tried making it ignore the top-5 slowest times and then taking the mean of the rest. These still didn’t really result in a fair result as there was still a lot of variance. Eventually we all agreed that the median time was probably the fairest given the variance in some of the solutions.

But this irked me somewhat, there was no obvious reason some of the solutions should be so variant.

It seemed like it was mostly the PHP solutions that had the variance, even after switching my container to alpine (which did result in quite a speed improvement over the non-alpine one) I was still seeing variance.

I was beginning to wonder if the host node was too busy. It didn’t look too busy, but it seemed like the only explanation. Moving the benchmarking container to a different host node (that was otherwise empty) seemed to confirm this somewhat. After doing that (and moving it back) I looked some more at the host node. I found an errant fail2ban process sitting using 200% CPU, and killing this did make some improvement (Though the node has 24 cores, so this shouldn’t really have mattered too much. If it wasn’t for AoCBench I wouldn’t even have noticed that!). But the variance remained, so I just let it be. Somewhat irked, but oh well.

AoCBench Matrix Page

We spent the next few evenings all optimising our solutions some more, vying for the fastest code. To level the playing feed some more, I even started feeding everyone the same input to counter the fact that some inputs were just fundamentally quicker than others. After ensuring that everyone was using the same output, the next step was to ensure that everyone gave the right answer and removing them from the table if they didn’t (This caught out a few “optimisations” that optimised away the right answer by mistake!). I also added support for running each solution against everyone else’s input files and displaying this in a grid to ensure that everyone was working for all inputs not just their own (or the normalised input that was being fed to them all).

After all this, the variance problem was still nagging away. One day in particular resulted in huge variances in some solutions (from less than 1s up to more than 15s some times). Something wasn’t right.

I’d already ruled out CPU usage from being at fault because the CPU just wasn’t being taxed. I’d added a sleep-delay between each run of the code in case the host node scheduler was penalising us for using a lot of CPU in quick succession. I’d even tried running all the containers from a tmpfs RAM disk in case the delay was being caused reading in the input data, but nothing seemed to help.

With my own solution, I was able to reproduce the variance on my own local machine, so it wasn’t just the chosen host node at fault. But why did it work so much better with no variance on the idle host node? And what made the code for this day so much worse than the surrounding days?

I began to wonder if it was memory related. Neither the host node or my local machine was particularly starved for memory, but I’d ruled out CPU and DISK I/O at this stage. I changed my code for Day 3 to use SplFixedArray and pre-allocated the whole array at start up before then interacting with it. And suddenly the variance was all but gone. The new solution was slow-as-heck comparatively, but there was no more variance!

So now that I knew what the problem was (Presumably the memory on the busy host node and my local machine is quite fragmented) I wondered how to fix it. Pre-allocating memory in code wasn’t an option with PHP so I couldn’t do that, and I also couldn’t pre-reserve a block of memory within each Docker container before running the solutions. But I could change the benchmarking container from running as an LXC Container to a full KVM VM. That would give me a reserved block of memory that wasn’t being fragmented by the host node and the other containers. Would this solve the problem?

Yes. It did. The extreme-variance went away entirely, without needing any changes to any code. I re-ran all the benchmarks for every person on every day and the levels of variance were within acceptable range for each one.

AoCBench Podium Mode

The next major change came about after Chris got annoyed by python (even under pypy) being unable to compete with the speed improvements that PHP7 has made, and switched to using Nim. Suddenly most of the competition was gone. The compiled code wins every time. every. time. (Obviously). So Podium Mode was added to allow for competing for the top 3 spaces on each day.

Finally, after a lot of confusion around implementations for Day 7 and how some inputs behaved differently than others in different ways in different code, the input matrix code was extended to allow feeding custom inputs to solutions to weed out miss-assumptions and see how they respond to input that isn’t quite so carefully crafted.

If anyone wants to follow along, I have AoCBench running here - and I have also documented here the requirements for making a repo AoCBench compatible. The code for AoCBench is fully open source under the MIT License and available on GitHub

Happy Advent of Code all!

mdadm RAID with Proxmox

Post thumbnail

I recently acquired a new server with 2 drives that I intended to use as RAID1 for a virtualisation host for various things.

My hypervisor of choice is Proxmox (For a few reasons, Support for KVM and LXC primarily, but the fact it’s debian based is a nice bonus, and I really dislike the occasionally-braindead networking implementation from vmware which rules out ESXi)

This particular server does not have a RAID card, so I needed to use a software raid implementation. Out of the box for RAID1 on Proxmox you need to use ZFS, however To keep this box similar to others I have I wanted to use ext4 and mdadm. So we’re going have to do a bit of manual poking to get this how we need it.

This post is mostly an aide-memoire for myself for the future.

Install Proxmox

So, first thing to do - is get a fresh proxmox install, I’m using 5.2-1 at the time of writing.

After the install is done, we should have 1 drive with a proxmox install, and 1 unused disk.

The installer will create a proxmox default layout that looks something like this (I’m using 1TB Drives):

Device      Start        End    Sectors   Size Type
/dev/sda1    2048       4095       2048     1M BIOS boot
/dev/sda2    4096     528383     524288   256M EFI System
/dev/sda3  528384 1953525134 1952996751 931.3G Linux LVM

This looks good, so now we can begin moving this to a RAID array.

Clone partition table from first drive to second drive.

In my examples, sda is the drive that we installed proxmox to, and sdb is the drive I want to use as a mirror.

To start with, let’s clone the partition table for sda to sdb, which is really easy on linux using sfdisk:

root@tirant:~# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ... OK

Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xa0492137

Old situation:

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: 7755C404-FEA5-004A-998C-F85E217AE7B7).
/dev/sdb1: Created a new partition 1 of type 'BIOS boot' and of size 1 MiB.
/dev/sdb2: Created a new partition 2 of type 'EFI System' and of size 256 MiB.
/dev/sdb3: Created a new partition 3 of type 'Linux LVM' and of size 931.3 GiB.
/dev/sdb4: Done.

New situation:

Device      Start        End    Sectors   Size Type
/dev/sdb1    2048       4095       2048     1M BIOS boot
/dev/sdb2    4096     528383     524288   256M EFI System
/dev/sdb3  528384 1953525134 1952996751 931.3G Linux LVM

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@tirant:~#

sdb now has the same partition table as sda. However we’re converting this to a raid1, so we’ll want to change the partition type, which we can also do easily with sfdisk:

root@tirant:~# sfdisk --part-type /dev/sdb 3 A19D880F-05FC-4D3B-A006-743F0F84911E

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@tirant:~#

(for MBR, you would use something like: sfdisk --part-type /dev/sdb 3 fd)

Set up mdadm

So now we need to setup a RAID1. mdadm isn’t installed by default so we’ll need to install it using: apt-get install mdadm (You may need to run apt-get update first).

Once mdadm is installed, lets create the raid1 (we’ll create an array with a “missing” disk to start with, we’ll add the first disk into the array in due course):

root@tirant:~# mdadm --create /dev/md0 --level=1 --raid-disks=2 missing /dev/sdb3
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
Continue creating array?
Continue creating array? (y/n) y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
root@tirant:~#

And now check that we have a working one-disk array:

root@tirant:~# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb3[1]
      976367296 blocks super 1.2 [2/1] [_U]
      bitmap: 8/8 pages [32KB], 65536KB chunk

unused devices: <none>
root@tirant:~#

Fantastic.

Move proxmox to the new array

Because proxmox uses lvm, this next step is quite straight forward.

Firstly, lets turn this new raid array into an lvm pv:

root@tirant:~# pvcreate /dev/md0
  Physical volume "/dev/md0" successfully created.
root@tirant:~#

And add it into the pve vg:

root@tirant:~# vgextend pve /dev/md0
  Volume group "pve" successfully extended
root@tirant:~#

Now we can move the proxmox install over to the new array using pvmove:

root@tirant:~# pvmove /dev/sda3 /dev/md0
  /dev/sda3: Moved: 0.00%
  /dev/sda3: Moved: 0.19%
  ...
  /dev/sda3: Moved: 99.85%
  /dev/sda3: Moved: 99.95%
  /dev/sda3: Moved: 100.00%
root@tirant:~#

(This will take some time depending on the size of your disks)

Once this is done, we can remove the non-raid disk from the vg:

root@tirant:~# vgreduce pve /dev/sda3
  Removed "/dev/sda3" from volume group "pve"
root@tirant:~#

And remove LVM from it:

root@tirant:~# pvremove /dev/sda3
  Labels on physical volume "/dev/sda3" successfully wiped.
root@tirant:~#

Now, we can add the new disk into the array.

We again change the partition type:

root@tirant:~# sfdisk --part-type /dev/sda 3 A19D880F-05FC-4D3B-A006-743F0F84911E

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@tirant:~#

and then add it into the array:

root@tirant:~# mdadm --add /dev/md0 /dev/sda3
mdadm: added /dev/sda3
root@tirant:~#

We can watch as the array is synced:

root@tirant:~# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda3[2] sdb3[1]
      976367296 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.1% (1056640/976367296) finish=123.0min speed=132080K/sec
      bitmap: 8/8 pages [32KB], 65536KB chunk

unused devices: <none>
root@tirant:~#

We need to wait for this to complete before continuing.

root@tirant:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda3[2] sdb3[1]
      976367296 blocks super 1.2 [2/2] [UU]
      bitmap: 1/8 pages [4KB], 65536KB chunk

unused devices: <none>
root@tirant:~#

Making the system bootable

Now we need to ensure we can boot this new system!

Add the required mdadm config to mdadm.conf

root@tirant:~# mdadm --examine --scan >> /etc/mdadm/mdadm.conf
root@tirant:~#

Add some required modules to grub:

echo '' >> /etc/default/grub
echo '# RAID' >> /etc/default/grub
echo 'GRUB_PRELOAD_MODULES="part_gpt mdraid09 mdraid1x lvm"' >> /etc/default/grub

and update grub and the kernel initramfs

root@tirant:~# update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.15.17-1-pve
Found initrd image: /boot/initrd.img-4.15.17-1-pve
Found memtest86+ image: /boot/memtest86+.bin
Found memtest86+ multiboot image: /boot/memtest86+_multiboot.bin
done
root@tirant:~# update-initramfs -u
update-initramfs: Generating /boot/initrd.img-4.15.17-1-pve
root@tirant:~#

And actually install grub to the disk.

root@tirant:~# grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
root@tirant:~#

If the server is booting via EFI, the output will be slightly different. We can also force it to install for the alternative platform using --target i386-pc or --target x86_64-efi, eg:

root@tirant:~# grub-install --target x86_64-efi --efi-directory /mnt/efi
Installing for x86_64-efi platform.
File descriptor 4 (/dev/sda2) leaked on vgs invocation. Parent PID 29184: grub-install
File descriptor 4 (/dev/sda2) leaked on vgs invocation. Parent PID 29184: grub-install
EFI variables are not supported on this system.
EFI variables are not supported on this system.
grub-install: error: efibootmgr failed to register the boot entry: No such file or directory.
root@tirant:~#

(/mnt/efi is /dev/sda2 mounted)

Now, clone the BIOS and EFI partitions from the old disk to the new one:

root@tirant:~# dd if=/dev/sda1 of=/dev/sdb1
2048+0 records in
2048+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0263653 s, 39.8 MB/s
root@tirant:~# dd if=/dev/sda2 of=/dev/sdb2
524288+0 records in
524288+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 5.48104 s, 49.0 MB/s
root@tirant:~#

Finally, reboot and test, if everything has worked, the server should boot up as normal.

DNS Hosting - Part 3: Putting it all together

Post thumbnail

In my previous posts I discussed the history leading up to, and the eventual rewrite of my DNS hosting solution. So this post will (finally) talk briefly about how it all runs in production on MyDNSHost.

Shortly before the whole rewrite I’d found myself playing around a bit with Docker for another project, so I decided early on that I was going to make use of Docker for the main bulk of the setup to allow me to not need to worry about incompatibilities between different parts of the stack that needed different versions of things, and to update different bits at different times.

The system is split up into a number of containers (and could probably be split up into more).

To start with, I had the following containers:

  • API Container - Deals with all the backend interactions
  • WEB Container - Runs the main frontend that people see. Interacts with the API Container to actually do anything.
  • DB Container - Holds all the data used by the API
  • BIND Container - Runs an instance of bind to handle DNSSEC signing and distributing DNS Zones to the public-facing servers.
  • CRON Container - This container runs a bunch of maintenance scripts to keep things tidy and initiate DNSSEC signing etc.

The tasks in the CRON container could probably be split up more, but for now I’m ok with having them in 1 container.

This worked well, however I found some annoyances when redeploying the API or WEB containers causing me to be logged out from the frontend, so another container was soon added:

  • MEMCACHED Container - Stores session data from the API and FRONTEND containers to allow for horizontal scaling and restarting of containers.

In the first instance, the API Container was also responsible for interactions with the BIND container. It would generate zone files on-demand when users made changes, and then poke BIND to load them. However this was eventually split out further, and another 3 containers were added:

  • GEARMAN Container - Runs an instance of Gearman for the API container to push jobs to.
  • REDIS Container - Holds the job data for GEARMAN.
  • WORKER Container - Runs a bunch of worker scripts to do the tasks the API Container previously did for generating/updating zone files and pushing to BIND.

Splitting these tasks out into the WORKER container made the frontend feel faster as it no longer needed to wait for things to happen and could just fire the jobs off into GEARMAN and let it worry about them. I also get some extra logging from this as the scripts can be a lot more verbose. In addition, if a worker can’t handle a job it can be rescheduled to try again and the workers can (in theory) be scaled out horizontally a bit more if needed.

There was some initial challenges with this - the main one being around how the database interaction worked, as the workers would fail after periods of inactivity and then get auto restarted and work immediately. This turned out to be mainly due to how I’d pulled out the code from the API into the workers. Whereas the scripts in API run using the traditional method where the script gets called and does it’s thing (including setup) then dies, the WORKER scripts were long-term processes, so the DB connections were eventually timing out and the code was not designed to handle this.

Finally, more recently I added statistical information about domains and servers, which required another 2 containers:

  • INFLUXDB Container - Runs InfluxDB to store time-series data and provide a nice way to query it for graphing.
  • CHRONOGRAF Container - Runs Chronograf to allow me to easily pull out data from INFLUXDB for testing.

That’s quite a few containers to manage. To actually manage running them, I make use of Docker-Compose primarily (to set up the various networks, volumes, containers) etc. This works well for the most part, but there are a few limitations around how it deals with restarting containers that cause fairly substantial downtime with upgrading WEB or API. To get around this I wrote a small bit of orchestration scripting that uses docker-compose to scale the WEB and API containers up to 2 (Letting docker-compose do the actual creation of the new container), then manually kills off the older container and then scales them back down to 1. This seems to behave well.

So with all these containers hanging around, I needed a way to deal with exposing them to the web, and automating the process of ensuring they had SSL Certificates (using Let’s Encrypt). Fortunately Chris Smith has already solved this problem for the most part in a way that worked for what I needed. In a blog post he describes a set of docker containers he created that automatically runs nginx to proxy towards other internal containers and obtain appropriate SSL certificates using DNS challenges. For the most part all that was required was running this and adding some labels to my existing containers and that was that…

Except this didn’t quite work initially, as I couldn’t do the required DNS challenges unless I hosted my DNS somewhere else, so I ended up adding support for HTTP Challenges and then I was able to use this without needing to host DNS elsewhere. (And in return Chris has added support for using MyDNSHost for the DNS Challenges, so it’s a win-win). My orchestration script also handles setting up and running the automatic nginx proxy containers.

This brings me to the public-facing DNS Servers. These are currently the only bit not running in Docker (though they could). These run on some standard Ubuntu 16.04 VMs with a small setup script that installs bind and an extra service to handle automatically adding/removing zones based on a special “catalog zone” due to the versions of bind currently in use not yet supporting them natively. The transferring of zones between the frontend and the public servers is done using standard DNS Notify and AXFR. DNSSEC is handled by the backend server pre-signing the zones before sending them to the public servers, which never see the signing keys.

By splitting jobs up this way, in theory it should be possible in future (if needed) to move away from BIND to alternatives (such as PowerDNS or so).

As well as the public service that I’m running, all of the code involved (All the containers and all the Orchestration) is available on Github under the MIT License. Documentation is a little light (read: pretty non-existent) but it’s all there for anyone else to use/improve/etc.

DNS Hosting - Part 2: The rewrite

Post thumbnail

In my previous post about DNS Hosting I discussed the history leading up to when I decided I needed a better personal DNS hosting solution. I decided to code one myself to replace what I had been using previously.

I decided there was a few things that were needed:

  • Fully-Featured API
    • I wanted full control over the zone data programmatically, everything should be possible via the API.
    • The API should be fully documented.
  • Fully-Featured default web interface.
    • There should be a web interface that fully implements the API. Just because there is an API shouldn’t mean it has to be used to get full functionality.
    • There should exist nothing that only the default web ui can do that can’t be done via the API as well.
  • Multi-User support
    • I also host some DNS for people who aren’t me, they should be able to manage their own DNS.
  • Domains should be shareable between from users
    • Every user should be able to have their own account
    • User accounts should be able to be granted access to domains that they need to be able to access
      • Different users should have different access levels:
        • Some just need to see the zone data
        • Some need to be able to edit it
        • Some need to be able to grant other users access
  • Backend Agnostic
    • The authoritative data for the zones should be stored independently from the software used to serve it to allow changing it easily in future

These were the basic criteria and what I started off with when I designed MyDNSHost.

MyDNSHost Homepage

Now that I had the basic criteria, I started off by coming up with a basic database structure for storing the data that I thought would suit my plans, and a basic framework for the API backend so that I could start creating some initial API endpoints. With this in place I was able to create the database structure, and pre-seed it with some test data. This would allow me to test the API as I created it.

I use chrome, so for testing the API I use the Restlet Client extension.

Armed with a database structure, a basic API framework, and some test data - I was ready to code!

Except I wasn’t.

Before I could start properly coding the API I needed to think of what endpoints I wanted, and how the interactions would work. I wanted the API to make sense, so wanted to get this all planned first so that I knew what I was aiming for.

I decided pretty early on that I was going to version the API - that way if I messed it all up I could re do it and not need to worry about backwards compatability, so for the time being, everything would exist under the /1.0/ directory. I came up with the following basic idea for endpoints:

MyDNSHost LoggedIn Homepage
  • Domains
    • GET /domains - List domains the current user has access to
    • GET /domains/<domain> - Get information about
    • POST /domains/<domain> - Update domain
    • DELETE /domains/<domain> - Delete domain
    • GET /domains/<domain>/records - Get records for
    • POST /domains/<domain>/records - Update records for
    • DELETE /domains/<domain>/records - Delete records for
    • GET /domains/<domain>/records/<recordid> - Get specific record for
    • POST /domains/<domain>/records/<recordid> - Update specific record for
    • DELETE /domains/<domain>/records/<recordid> - Delete specific record for
  • Users
    • GET /users - Get a list of users (non-admin users should only see themselves)
    • GET /users/(<userid>|self) - Get information about a specific user (or the current user)
    • POST /users/(<userid>|self) - Update information about a specific user (or the current user)
    • DELETE /users/<userid> - Delete a specific user (or the current user)
    • GET /users/(<userid>|self)/domains - Get a list of domains for the given user (or the current user)
  • General
    • GET /ping - Check that the API is responding
    • GET /version - Get version info for the API
    • GET /userdata - Get information about the current login (user, access-level, etc)

This looked sane so I set about with the actual coding!

Rather than messing around with oauth tokens and the like I decided that every request to the API should be authenticated. Initially using basic-auth and username/password, but eventually also using API Keys, this made things fairly simple whilst testing, and made interacting with the API via scripts quite straight forward (no need to grab a token first and then do things).

The initial implementation of the API with domain/user editing functionality and API Key support was completed within a day, and then followed a week of evenings tweaking and adding functionality that would be needed later - such as internal “hook” points for when certain actions happened (changing records etc) so that I could add code to actually push these changes to a DNS Server. As I was developing the API, I also made sure to document it using API Blueprint and Aglio - it was easier to keep it up to date as I went, than to write it all after-the-fact.

Once I was happy with the basic API functionality and knew from my (manual) testing that it functioned as desired, I set about on the Web UI. I knew I was going to use Bootstrap for this because I am very much not a UI person and bootstrap helps make my stuff look less awful.

MyDNSHost Records View

Now, I should point out here, I’m not a developer for my day job, most of what I write I write for myself to “scratch an itch” so to speak. I don’t keep up with all the latest frameworks and best practices and all that. I only recently in the last year switched away from hand-managing project dependencies in Java to using gradle and letting it do it for me.

So for the web UI I decided to experiment and try and do things “properly”. I decided to use composer for dependency management for the first time and then used a 3rd party request-router Bramus/Router for handling how pages are loaded and used Twig for templating. (At this point, the API code was all hand-coded with no 3rd party dependencies. However my experiment with the front end was successful and the API Code has since changed to also make use of composer and some 3rd party dependencies for some functionality.)

The UI was much quicker to get to an initial usable state - as all the heavy lifting was already handled by the backend API code, the UI just had to display this nicely.

I then spent a few more evenings and weekends fleshing things out a bit more, and adding in things that I’d missed in my initial design and implementations. I also wrote some of the internal “hooks” that were needed to make the API able to interact with BIND and PowerDNS for actually serving DNS Data.

As this went on, whilst the API Layout I planned stayed mostly static except with a bunch more routes added, I did end up revisiting some of my initial decisions:

  • I moved from a level-based user-access to the system for separating users and admins, to an entirely role-based system.
    • Different users can be granted access to do different things (eg manage users, impersonate users, manage all domains, create domains, etc)
  • I made domains entirely user-agnostic
    • Initially each domain had an “owner” user, but this was changed so that ownership over a domain is treated the same as any other level of access on the domain.
    • This means that domains can technically be owned by multiple people (Though in normal practice an “owner” can’t add another user as an “owner” - only users with “Manage all domains” permission can add users at the “owner” level)
    • This also allows domain-level API Keys that can be used to only make changes to a certain domain not all domains a user has access to.

Eventually I had a UI and API system that seemed to do what I needed and I could look at actually putting this live and starting to use it (which I’ll talk about in the next post).

After the system went live I also added a few more features that were requested by users that weren’t part of my initial requirements, such as:

  • TOTP 2FA Support
    • With “remember this device” option rather than needing to enter a code every time you log in.
  • DNSSEC Support
  • EMAIL Notifications when certain important actions occur
    • User API Key added/changed
    • User 2FA Key added/changed
  • WebHooks when ever zone data is changed
  • Ability to use your own servers for hosting the zone data not mine
    • The live system automatically allows AXFR for a zone from any server listed as an NS on the domain and sends appropriate notifies.
  • Domain Statistics (such as queries per server, per record type, per domain etc)
  • IDN Support
  • Ability to import and export raw BIND data.
    • This makes it easier for people to move to/from the system without needing any interaction with admin users or needing to write any code to deal with zone files.
    • Ability to import Cloudflare-style zone exports.
      • These look like BIND zone files, but are slightly invalid, this lets users just import from cloudflare without needing to manually fix up the zones.
  • Support for “Exotic” record types: CAA, SSHFP, TLSA etc.
  • Don’t allow domains to be added to accounts if they are sub-domains of an already-known about domain.
    • As a result of this, also don’t allow people to add obviously-invalid domains or whole TLDs etc.

HUGO PPA

Post thumbnail

I run ubuntu on my servers, and since moving to Hugo, I wanted to make sure I was using the latest version available.

The ubuntu repos currently contain hugo version 0.15 in Xenial, and 0.25.1 in artful (And the next version, bionic only contains 0.26). The latest version of hugo (as of today) is currently 0.32.2 - so the main repos are quite a bit out of date.

So to work around this, I’ve setup an apt repo that tracks the latest release for hugo, which can be installed and used like so:

sudo wget http://packages.dataforce.org.uk/packages.dataforce.org.uk_hugo.list -O /etc/apt/sources.list.d/packages.dataforce.org.uk_hugo.list
wget -qO- http://packages.dataforce.org.uk/pubkey.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install hugo

This repo tracks the latest hugo debs in all 4 of the architectures supported: amd64, i386, armhf and arm64 and should stay automatically up to date with the latest version.