Friday, March 03, 2017

My Quest to Demonstrate EVPN Multihoming to the Server in a Virtualized Data Center Topology

I recently had a customer ask if I could help him develop a proof of concept to illustrate how EVPN multihoming could be used in a datacenter environment to replace technologies like Link Aggregation Groups (LAG) and Multi-Chassis LAG (MC-LAG) to support connections to dual-attached servers.  Eager to prove the concept out, and lacking in physical hardware to rapidly build and test the topology, I decided to try to implement the POC in a KVM environment, with Wistar acting as the topology manager for the environment.  Fundamentally, I was building what has become a "standard" layer 3 Clos datacenter fabric composed of Juniper virtual QFX switches attached to some Ubuntu servers.  Seemed simple enough, but clearly I hadn't thought through all the details...

Standing up the virtualized physical topology was simple enough -- two vQFX spine switches, three vQFX leaf switches.  Connect one server to Leaf #3 by a single Ethernet interface.  Connect the other server to Leaf #1 and #2 -- a single Ethernet interface from the server to each switch.  (Just for fun, I planted a Juniper virtual MX at the top as DC edge router.  Not necessary for the POC, but still fun to play with.)  In all, the topology looks like this:


I had previously written an Ansible playbook that automagically built the configs for each of the Spine and Leaf devices in the topology.  It wasn't perfect, as it only accounted for single (rather than dual) attached servers, but it got me to 98% configured in five minutes.  All that was left was to configure the aggregated Ethernet (ae) interface and its respective child links on Leaf #1 and #2.  Simple.

I started the topology and quickly realized I had a problem.  I was running standard Juniper LAG on the vQFX leaf nodes -- basically that means setting LACP active on the AE interface.  And I had set up bonding on Server #1 for mode 4 (LACP) with hashing set to layer3+4, as close to an analogous config as I could get on the server with respect to the Junos configuration.  But the AE link was down on the switch side...  Careful inspection revealed that no LACP frames were reaching the server.  The vQFX switches were sending them, but the server was not receiving them.  Nor was the server actually sending any LACP of its own.

It took a few seconds for the input to process...  The server was not connected to the vQFX leaf nodes via virtual wires, it was connected via Linux bridges!  And LACP frames are of a format (01:80:c2:...) that 802.1D compliant switches do not forward.  So the LACP exchange between the leaf nodes and the server were being swallowed by a couple of Linux bridges...

The good news is that there's a fix for that!  Since the 3.2 kernel, developers have included a little tool called a Group Forward Mask that allows you to direct the bridge to ignore (and hence forward) certain layer 2 protocols.  You can write to the Mask in this way:

echo maskValue > /sys/class/net/brXXX/bridge/group_fwd_mask

where maskValue sets the bits in the lower half of the MAC address that you want the bridge to ignore/forward, and brXXX is the name of the bridge where you want to implement this change.  Simple, and the scope is reasonably limited.  So I just wrote 255 to the mask so that my bridges-in-question would just ignore any values in the last octet and forward all of the potential "interesting" traffic (Spanning Tree, LACP, LLDP, etc.) onwards.  Except it didn't work.  I tested with LLDP, and that worked beautifully.  But I still couldn't get my Linux bridges to forward LACP frames.  Hmm...

So a little more reading turned up something else.  The folks who implemented the group_fwd_mask change were afraid of folks like me, and they were concerned that allowing too many protocols (like Spanning Tree and LACP) through could be disastrous -- and they're absolutely correct.  So they implemented another feature in conjunction with the group_fwd_mask -- a #define named BR_GROUPFWD_RESTRICTED that is set to 0x7u, which prevents you from modifying the lowest three bits in the forwarding mask.  So you can't change the bridge behavior to permit Spanning Tree (01:80:c2:00:00:01) or LACP (01:80:c2:00:00:02), among others.

Now it's on.  That define is contained in the Linux source tree in the net/bridge/br_private.h header file.  So, add the Linux source package to the hypervisor platform, recompile with BR_GROUPFWD_RESTRICTED set to 0x0u, and reboot on the new kernel.  Then "echo 255 > /sys/class/net/t5_br9/bridge/group_fwd_mask ; echo 255 > /sys/class/net/t5_br10/bridge/group_fwd_mask" and go.  Take that people!

That only half-fixed the issue.  At this point, I could see LACP frames from the vQFX leaf nodes reaching Server #1.  Literally, looking at the output from tcpdump -e -i ens4 (one of my child links on the server) showed perfectly-formatted LACP from the vQFX reaching the server child link.  But the output from "cat /proc/net/bonding/bond0" seemed to indicate that the server wasn't actually processing the frames.  The syslog output seemed to corroborate that as well, as no child links were joining the bond.  And on top of that, the server was not sending any LACP frames.  And yet, LLDP was working great.  And if I forced the vQFX side up by committing "set interfaces ae0 aggregated-ether-options lacp force-up" then I could pass traffic between Server #1 and Server #2.  Weird...

I spent a couple more days occasionally Googling variations of "lacp" and "ubuntu" with "active mode" and "not receiving" to look for answers.  And I tried all permutations of configuration in /etc/network/interfaces where I did or did not add explicit slave devices to the bond interface, and where I did or did not assign a bond master to the child links ... ifdown/ifup combinations ... reboots ...etc.  Then I found this statement that was both odd and obvious on The Geek Stuff Blog:

If the Speed, duplex & Link status is unknown then the interface may be in down status. Try to bring up the interface using “ifconfig up”. If you still do not see the link then the interface is not connected to the switch.

I knew my server and switches were properly connected through the Linux bridges, and they were clearly working, as it seemed I could pass everything BUT stinkin' LACP over the links.  But what did I have to lose?  I checked the output from "ethtool ens4", and sure enough, ethtool reported a whole lot of nothing:

Settings for ens4:
Supported ports: [ ]
Supported link modes:   Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: Unknown!
Duplex: Unknown! (255)
Port: Other
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Link detected: yes

So back to check the config for my server Ethernet ports.  It seems that when Wistar built the topology, the network interfaces were created as type virtio.  That clearly wasn't working, so what about good ol' e1000?  Shut the server down, changed the AE child links to type e1000 and rebooted.  Now the ethtool output looks like this:

Settings for ens4:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
                       100baseT/Half 100baseT/Full
                       1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
                       100baseT/Half 100baseT/Full
                       1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
MDI-X: off (auto)
Cannot get wake-on-lan settings: Operation not permitted
Current message level: 0x00000007 (7)
      drv probe link
Link detected: yes

Well that's a lot better!  And all of the sudden, the output from "cat /proc/net/bonding/bond0" shows healthy data.  And my vQFX's see LACP from the server.  Rolling back the vQFX configs to drop the "force-up" change, it all still works.  Flawless!  I've been running Ping, SSH, SCP, etc. without issue ever since.  And I can even see the various sessions being load-balanced across the child links in the bonded interface!

So there you have it.  While literally nothing else cares about the speed and duplex settings reported out by ethtool, LACP cares.  And without real data to indicate that the links are up, Linux will not do any LACP processing.  And if you want to build your own EVPN multi-homing POC on KVM, remember these important points:

  1. You will need to build your own kernel to tweak the BR_GROUPFWD_RESTRICTED define so that you can manipulate the lower three bits in the group_fwd_mask.
  2. Running that kernel, you will need to write a value to the group_fwd_mask for the correct Linux bridge that directs it to forward LACP.  Do this with great care, as there is a reason why 802.1D bridges do not forward this traffic by default.  Best to ensure that the switch in question only has two connected devices/interfaces -- the ones at each end of your virtual wire.
  3. You will also need to be sure that the virtual network interface you use on your virtual Linux hosts properly reports interface state in ethtool.  In my case, virtio did not, but e1000 did.


Happy hacking!