Table Of Content
A Note About VirtualBox and Cumulus
Cumulus and Unnumbered Interfaces
Spines (Router 1 and Router 2)
Leaves (Switch 1 and Switch 2)
The Case Against: The Service Leaf
Optimizing North-South Traffic
Changes to /etc/network/interfaces
Appendix A – Spine 1 Configuration Files
Appendix B – Spine 2 Configuration Files
Appendix C – Leaf 1 Configuration Files
Introduction and Purpose
Introduction
Almost three years ago, I wrote this long and detailed post about building a VXLAN-enabled virtual lab on a KVM hypervisor, using Arista’s vEOS image. Prior to that, I also wrote up this post regarding the ideas around using VXLAN for spanning L2 across an L3 infrastructure. In the second post, I wrote about the configuration challenges of static, unicast VXLAN configuration:
Further, VXLAN’s primary benefit is also a potential drawback: you have to manually configure each tunnel end point to mesh it with the rest of them. Vendors such as Arista provide a way to do that automatically using their services. Or the network team can write automation to handle the addition and subtraction of VXLAN TEPs. Neither of these are more or less advantageous, but it’s something that must be kept in mind if the idea of manually meshing a lot (eg: 120?) switches together with their upstream routers seems daunting.
Thankfully, the geeks in the networking world figured out a way to do that which involves BGP: an EVPN control plane for VXLAN.
Purpose
The purpose of this document is to describe my learning process using Cumulus Network’s OS to build a virtual seven-node network. This network will employ an EVPN control plane for VXLAN so that static unicast VXLAN configuration isn’t necessary.
EVPN Control Plane
I’m not going to spend a bunch of time discussing the ins and outs of using an EVPN control plane to handle your VXLAN deployments. There are several resources out there to learn about that from. Such as:
The idea is to use an extension to BGP to, essentially, announce L2 MAC addresses via BGP. I’m vastly over-simplifying the definition, but that’s basically what it is. As each leaf node forms an L2VPN EVPN peer with the upstream spines, it’ll begin participating in the control plane, form VTEPs with other leafs, and switch Ethernet frames between them using the spines. No manual unicast VTEPs need to be created.
Doing This With Cumulus
I spent a considerable amount of time learning about this concept at my previous role at Verisign. At that point in time, Cisco’s Nexus 7K and 9K platforms had a fairly mature and easy to configure EVPN control plane for VXLAN. Since then, however, I’ve not had much of an opportunity to deploy or test it out on other vendors’ switches and routers. But lately, I’ve turned my attention towards Cumulus Network’s Linux OS for white box switches. Thankfully, Cumulus has virtual versions of their OS available for most hypervisors out there including VMWare, KVM, and VirtualBox. The rest of this document will focus on Cumulus’ configuration in a virtual environment.
VirtualBox Environment
You’ll recall from my previous entries that I had a Linux KVM server running specifically for these sorts of labs and education. I attempted to use the same box 3 years later but it appears the PS in the server is dead. The box will not power on, and after connecting the PS to a PS tester, I saw that it was not delivering +12V. Which is a bit of a problem. Power supplies aren’t expensive, but I just don’t have an extra one handy. So instead of waiting for one to ship, I decided to attempt to do this on VirtualBox on my Mac laptop.
VM Connections
In order for this to work properly, the hypervisor in question has to allow VMs to be connected, virtually, together. Fortunately, VirtualBox offers that flexibility when you configure each VM’s network adapters. As an example, I have one of my spines here, router1. And you can see that Network Adapter 2 is attached to an “Internal Network” that I’ve named “intnet-r1-sw1”.
Now when I create the VM for switch1, one of its interfaces will be on that exact, same internal network. And when the VMs are both running, they’ll be virtually directly connected together.
A Note About VirtualBox and Cumulus
The Cumulus virtual OS comes ready to support 8 switch ports, named swp1, swp2, … swp8. It also has an Eth0 interface which would correspond to a real switch’s management Ethernet port. However, VirtualBox only allows for 4 Network Adapters per VM. And for some reason I’ve yet to determine, it associates Adapter 1 to Cumulus’ swp4. Meaning that each VM can only actually use swp4, swp5, swp6, and swp7. Interfaces Eth0, swp1-3, and swp8 seem to be unavailable via VirtualBox.
root@router1:/etc/frr# ifconfig swp5
swp5 Link encap:Ethernet HWaddr 08:00:27:14:07:fd
inet6 addr: fe80::a00:27ff:fe14:7fd/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9216 Metric:1
RX packets:44239 errors:0 dropped:0 overruns:0 frame:0
TX packets:36094 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4416338 (4.2 MiB) TX bytes:3627783 (3.4 MiB
You can see the output from an ifconfig swp5 above. Note the HWAddr, which corresponds to Network Adapter 2 in the previous image.
This isn’t a limitation that will cause my any issues really. I don’t expect to need more than 4 connections to any given virtual network device. But the limitation and numbering will need to be kept in mind.
“Physical” Layout
The above image shows how the virtual “physical” layout of the switches and servers is configured and connected. I’ll get into some of the important bits of each layer in this section. The basic idea was to have a simple two-leaf/two-spine architecture with a sever on each leaf, and all topped off with an aggregate router. We’ll start top down.
Agg Router
The agg router’s job is pretty simple: aggregate the IP knowledge of the entire virtual environment and provide a single gateway into and out of it. The VM configuration for the agg router is a bit different than the rest of the VMs; specifically its first interface swp4. I needed an IP path to get into and out of the environment so that I could more easily configure the VMs, as well as test to make sure the architecture works. To that end, the agg router’s swp4 (remember that’s the VM’s first interface according to VirtualBox) is actually bridged with my laptop’s en0 wireless interface.
This configuration allows my Mac, along with anything else on my private wireless network at home, to get to the agg router’s swp4 interface. The /etc/network/interface file on the agg router looks like this:
# The loopback network interface auto lo iface lo inet loopback address 10.100.0.10/32 auto swp4 iface swp4 inet dhcp auto swp5 iface swp5 address 10.0.3.0/31 auto swp6 iface swp6 address 10.0.3.2/31
Since swp4 is trying to DHCP out for an IP and it’s bridged to my internal wireless network, it gets one. In fact if we look at the Cumulus CLI command net show interfaces, we see:
# net show interfa State Name Spd MTU Mode LLDP Summary ----- ---- --- ----- ------------ -------------------- -------------------------- UP lo N/A 65536 Loopback IP: 127.0.0.1/8 lo IP: 10.100.0.10/32 lo IP: ::1/128 UP swp4 1G 1500 Interface/L3 livingroom-sw0 (gi4) IP: 192.168.10.80/24(DHCP) UP swp5 1G 1500 Interface/L3 router1 (swp7) IP: 10.0.3.0/31 UP swp6 1G 1500 Interface/L3 router2 (swp7) IP: 10.0.3.2/31
A few things to point out here:
- Cumulus interfaces all have LLDP running by default. You can see that in the “LLDP” column of the output. The agg router has 2 other Cumulus devices hanging off of it on swp5 and 6. But swp4 sees a different device: the Cisco switch in my living room (which is what my WAP is connected to).
- Interface swp4 did, in fact, pick up a DHCP IP of 192.168.10.80/24
- Each of the other configured interfaces have static IPs on them.
Cumulus and Unnumbered Interfaces
We’ll see later that Cumulus does support unnumbered interfaces. Which, if you’ll pardon the editorial momentarily: is pretty fricken cool. It allows you to just tell Cumulus, “Hey, go ahead and form a BGP peer on that interface” without pre-assigning an IPv4 IP on it. It uses the IPv6 link-local address to do so, and once the BGP peer is formed, IPv4 and IPv6 prefixes will be exchanged if desired. All automatically.
But in this case, I wanted to treat the agg router as something that isn’t a Cumulus. I’m imagining a big pair of agg routers in a data center or something like that; likely from another vendor. With that, I decided to statically configure the IP addresses on the interfaces to make it seem more generic.
IP Connectivity With Host Mac
With the aforementioned DHCP IP of 192.168.10.80, the agg router should be reachable from the host Mac. Let’s check:
deadshot# ping 192.168.10.80 PING 192.168.10.80 (192.168.10.80): 56 data bytes 64 bytes from 192.168.10.80: icmp_seq=0 ttl=64 time=0.274 ms 64 bytes from 192.168.10.80: icmp_seq=1 ttl=64 time=0.318 ms 64 bytes from 192.168.10.80: icmp_seq=2 ttl=64 time=0.222 ms 64 bytes from 192.168.10.80: icmp_seq=3 ttl=64 time=0.306 ms ^C --- 192.168.10.80 ping statistics --- 4 packets transmitted, 4 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.222/0.280/0.318/0.037 ms
Success. We’ll need some host-based static routes on the Mac if we’re going to get into that network beyond the agg router’s swp4 interface. With that:
deadshot# route add -net 10.0.0.0/8 192.168.10.80 add net 10.0.0.0: gateway 192.168.10.80 deadshot# route add -net 172.16.0.0/12 192.168.10.80 add net 172.16.0.0: gateway 192.168.10.80
Can we get to the loopback 10.100.0.10?
deadshot# telnet 10.100.0.10 22 Trying 10.100.0.10... Connected to 10.100.0.10. Escape character is '^]'. SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u4
Yep. Success. Being able to ssh into and out of the VMs will make configuration much easier. The VM console window on VirtualBox is a bit clumsy.
BGP Config
In order to get BGP running on Cumulus, you have to make sure that both zebra and bgpd are running. This is handled in the file /etc/frr/daemons. Make sure these two lines says “yes”:
zebra=yes bgpd=yes
The rest of the routing config is handled in the file /etc/frr/frr.conf. Alternatively, you can use zebra’s vtysh command to start a more comfortable and familiar CLI for configuration. But I’m going to focus on making file changes and restarting daemons in this blog. The agg router’s frr.conf file looks like:
# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log # in /var/log/frr/frr.log log syslog informational router bgp 65210 bgp router-id 10.100.0.10 redistribute connected ! neighbor spine peer-group neighbor spine remote-as external neighbor spine capability extended-nexthop neighbor spine default-originate ! neighbor swp5 interface peer-group spine neighbor swp6 interface peer-group spine
Anyone familiar with Cisco, Arista, or other similar BGP configurations will feel pretty at-home with this. Except for the last two lines: they’re part of what make the Cumulus configuration so simple and flexible. Just above those two lines, I set up a simple “spine” peer-group with various properties. Most notable it’s an EBGP peer and the agg will be originating a default to the spines.
The two neighbor statements just tell the Cumulus router to form a BGP peer with whatever’s connected to those interfaces, and put the new neighbors in the “spine” peer group. Ridiculously easy.
Once these files are edited, a simple service frr restart command from the Linux CLI gets the ball rolling.
Spines (Router 1 and Router 2)
The image above shows how Router 1 is connected up; Router 2 is identical. Port swp7 is the uplink port to the previously-described agg router. Remember that although in this lab the agg router is a Cumulus device, I’m treating it as if it weren’t. So there’s an IPv4 address on the interface connected to the agg. The other 3 interfaces are unnumbered. Here’s what the /etc/network/interfaces file looks like:
# The loopback network interface auto lo iface lo inet loopback address 10.100.0.1/32 # The primary network interface #auto eth0 #iface eth0 inet dhcp auto swp4 iface swp4 mtu 9216 auto swp5 iface swp5 mtu 9216 auto swp6 iface swp6 mtu 9216 auto swp7 iface swp7 address 10.0.3.1/31
Pretty simple stuff. I set IP addresses on the loopback interface as well as port swp7. The other three aren’t numbered; they just have a jumbo MTU set on them. I did that because VXLAN is an overlay, and that may require packets larger than 1500 bytes.
# net show inter State Name Spd MTU Mode LLDP Summary ----- -------- --- ----- ------------- -------------- ------------------ UP lo N/A 65536 Loopback IP: 127.0.0.1/8 lo IP: 10.100.0.1/32 lo IP: ::1/128 UP swp4 1G 9216 NotConfigured router2 (swp4) UP swp5 1G 9216 NotConfigured sw1 (swp4) UP swp6 1G 9216 NotConfigured sw2 (swp4) UP swp7 1G 1500 Interface/L3 agg (swp5) IP: 10.0.3.1/31
The lldp clearly shows that swp4 is connected to the other spine, swp5 and 6 are connected to leaf1 and 2, and swp7 is the uplink to the agg router.
BGP Config
How about the BGP config on the spines? It’s a bit more involved than the agg, but not by much:
router bgp 65200 bgp router-id 10.100.0.1 bgp bestpath as-path multipath-relax ! neighbor swp7 interface neighbor swp7 remote-as external ! neighbor swp4 interface neighbor swp4 remote-as internal neighbor swp4 next-hop-self ! neighbor leaf peer-group neighbor leaf remote-as external ! neighbor swp5 interface peer-group leaf neighbor swp6 interface peer-group leaf ! ! IPv4 address-family ipv4 unicast redistribute connected neighbor leaf activate exit-address-family
The neighbor on swp7 is our agg router and it’s an external (eBGP) peer. The neighbor on swp4 is the other spine so the peering an iBGP one. Then I’ve set up a simple peer group for the leaves; they’re just another eBGP peer. Assign them, activate them, and redistribute the connected into BGP. In the output below, you’ll see that the leaves are already configured and routing via BGP. We’ll get into that in the next section. But for now:
# net show bgp ipv4 unicast sum BGP router identifier 10.100.0.1, local AS number 65200 vrf-id 0 BGP table version 17 RIB entries 17, using 2584 bytes of memory Peers 4, using 77 KiB of memory Peer groups 1, using 64 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd sw1(swp5) 4 65201 17519 17521 0 0 0 14:35:07 2 sw2(swp6) 4 65201 17518 17521 0 0 0 14:35:07 2 router2(swp4) 4 65200 17507 17507 0 0 0 14:35:07 8 agg(swp7) 4 65210 17495 17500 0 0 0 01:09:20 5
And let’s make sure we’re hearing a 0/0 from the agg:
# net show route 0.0.0.0 RIB entry for 0.0.0.0 ===================== Routing entry for 0.0.0.0/0 Known via "bgp", distance 20, metric 0, best Last update 01:10:40 ago * 10.0.3.0, via swp7 FIB entry for 0.0.0.0 ===================== default via 10.0.3.0 dev swp7 proto bgp metric 20
Yep. Can we get to routers 1 and 2 via IP, from the host Mac?
deadshot# telnet 10.100.0.1 22 Trying 10.100.0.1... Connected to 10.100.0.1. Escape character is '^]'. SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u4
Success.
Leaves (Switch 1 and Switch 2)
You can see how the leaf is connected, based on the diagram above. Each is identical. The first two interfaces are unnumbered. The third is a bit different. Port swp6 will be what our server connects to. We’ll put the server on VLAN100 via an access port on the leaf. Some of the VLAN configuration will show up a bit later as we discuss EVPN and VXLAN. But for now, the /etc/network/interfaces file:
# The loopback network interface auto lo iface lo inet loopback # The primary network interface address 10.100.0.3/32 # The primary network interface #auto eth0 #iface eth0 inet dhcp auto swp4 iface swp4 mtu 9216 auto swp5 iface swp5 mtu 9216 auto swp6 iface swp6 bridge-access 100 auto bridge iface bridge bridge-ports swp6 bridge-vids 100 bridge-vlan-aware yes
Again, swp4 and 5 are unnumbered. Port swp6 has been set up with bridge access to VLAN 100. And we need a bridge port configured with VLAN configuration as shown. Missing is the SVI interface for VLAN 100. In other words: the server that comes up on VLAN 100 won’t have ingress/egress yet. We’ll get to that a bit later.
BGP Config
Let’s show the BGP configuration on the leaf, even though we know it’s working from the previous section:
router bgp 65201 bgp router-id 10.100.0.3 bgp bestpath as-path multipath-relax neighbor spine peer-group neighbor spine remote-as external neighbor swp4 interface peer-group spine neighbor swp5 interface peer-group spine ! address-family ipv4 unicast redistribute connected neighbor spine allowas-in origin exit-address-family !
Simple and easy, and very like the upstream spines. Interface-peer via ports swp4 and 5, and do so unnumbered. One small change is that we need to allow our local ASN in. Each leaf is in ASN 65201; if any leaf announces something via IPv4 that another leaf needs to hear for some reason, we have to allow ASN 65201 in. This will, again, become a bit more important in a moment.
# net show bgp ipv4 uni sum BGP router identifier 10.100.0.3, local AS number 65201 vrf-id 0 BGP table version 19 RIB entries 17, using 2584 bytes of memory Peers 2, using 39 KiB of memory Peer groups 1, using 64 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd router1(swp4) 4 65200 17855 17854 0 0 0 14:51:47 10 router2(swp5) 4 65200 17856 17857 0 0 0 14:51:51 10
# net show route 0.0.0.0 RIB entry for 0.0.0.0 ===================== Routing entry for 0.0.0.0/0 Known via "bgp", distance 20, metric 0, best Last update 14:52:03 ago * fe80::a00:27ff:fe14:7fd, via swp4 * fe80::a00:27ff:fefd:b2b0, via swp5 FIB entry for 0.0.0.0 ===================== default proto bgp metric 20 nexthop via 169.254.0.1 dev swp4 weight 1 onlink nexthop via 169.254.0.1 dev swp5 weight 1 onlink
And here you can see what happens when you use unnumbered BGP. The show route 0.0.0.0 command returns both spines. But they’re IPv6 addresses. More specifically, they’re link-local IPv6 addresses. Ultimately, that’s how the unnumbered trick is working with Cumulus. The nexthop is an IPv4 address of 169.254.0.1, but the interfaces are clearly two different ones.
Pretty cool, huh?
And let’s do the same SSH test as before, from the host Mac:
deadshot# telnet 10.100.0.3 22 Trying 10.100.0.3... Connected to 10.100.0.3. Escape character is '^]'. SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u4
Done.
EVPN And VXLAN
I’m going to jump into the EVPN and VXLAN configurations of the four network devices in this section. Before I do that, I’d like to address a missing layer in the network and why I didn’t build it. This is going to seem like a tangent, but it’ll make sense a bit later. Stay with me.
The Case Against: The Service Leaf
I hate to use the word “traditional” when describing the diagram above, because that word generally indicates something that’s been around for a while. The design in the previous image is something that a certain network company, we’ll call them Company A, developed a few years ago. They coined the phrase “Leaf/Spine” network architecture and the industry quickly adopted the phrase. I can assure you: they weren’t the first folks to actually come up with the architecture; I was building networking like that back in the mid 1990s. Long before Company A was actually a thing.
Regardless of that, one of the key points in the design, which differs from mine, is the “Service Leaf” layer. Some companies call it the service leaf, others the border leaf, etc. I call it: a waste of hardware and a compromise in the network design that shouldn’t be made. Let’s dive into that.
Company A might tell you that the Service Leaf exists to keep the configuration on the Spines as simple as possible. They don’t want the Spines actively participating in VXLAN; just routing BGP packets to and from each leaf. The service leaf’s job is to form VTEPs with all of the tenant leaves, and provide a routed path into and out of the entire network. It’s to the service leaves that an EBGP peer would come in, either from an data center aggregation router, or an external ISP, or whatever. The service leaves are the same make and model as the tenant leaves, and that means they can actually perform VXLAN routing.
That part is key. It’s because of that, and not to keep the “spines simple”, that the service leaf exists. Company A’s first attempt at a big, dense, chassis-based switch didn’t have VXLAN routing in hardware. They couldn’t do it. Their top-of-rack switches supported it, easily. But the chassis didn’t. They didn’t have the appropriate merchant silicon on the line cards in the chassis. So: no VXLAN routing at first.
“Keep the spine config simple?” Or is it “We can’t do it on the spine?” I know which one I’m putting my money on.
The problem with the design is: extra layers of hardware for a packet to flow through. I’m thinking of an entire system. And entire data center. I’m not thinking about the little corner of the data center that has VXLAN running. End to end. The overall reliability of a system decreases with every device a packet has to flow through. It is important, in my opinion, that the network architecture reflects this reliability issue as best it can. That means: where ever it makes sense and where ever you can: remove network hardware. Don’t add it!
This is why I’m adamantly against the service leaf layer. Today, spine architectures are all made from the same merchant silicon that are in the switches. They can easily handle VXLAN routing. So: do it there!
Now then…
Spine Changes
Now that we have the L3 underpinning, we can add the overlay and control plane on top of it. This is pretty easy on Cumulus. No changes need to be made to the /etc/network/interfaces file.
Changes to /etc/frr/frr.conf
! ! EVPN config address-family l2vpn evpn neighbor leaf activate ! exit-address-family
A simple l2vpn evpn address family needs to be added, and the same leaf peer-group that was previously defined needs to be activated. And that’s it. Just perform a service frr restart and the spines are ready to go.
Leaf Changes
The changes at the switch layer are a bit more complicated because we need to the VLAN 100 SVI configured, VLAN 100 added into VXLAN, and then the EVPN control plane configured.
Changes to /etc/interfaces
auto bridge iface bridge bridge-ports swp6 vni10100 bridge-vids 100 bridge-vlan-aware yes auto vni10100 iface vni10100 bridge-access 100 bridge-arp-nd-suppress on bridge-learning off mstpctl-bpduguard yes mstpctl-portbpdufilter yes mtu 9000 vxlan-id 10100 vxlan-local-tunnelip 10.100.0.3 auto vlan100 iface vlan100 address 172.16.0.2/24 vlan-id 100 vlan-raw-device bridge address-virtual 44:39:39:FF:40:94 172.16.0.1/24
Here I’ve added two new interfaces: vni10100 and vlan100. The later is the SVI for VLAN100. It’s what will provided the routed path into and out of VLAN 100. Pretty simple. But let’s look at it a bit more closely: we see it has an address of .2, but a virtual address of .1 along with a MAC address set. That same virtual address and MAC need to be set on the VLAN 100 interface on switch 2. Like previous VXLAN configurations, this is how servers can be moved around on different switches in VLAN 100, and retain their same IP and default route. Each switch participating in VXLAN routing for VLAN 100 will need that virtual address line.
The vni10100 interface is what gets VXLAN going. The important bits are that it’s part of VLAN 100, has a VXLAN ID of 10100, and is using the switch’s loopback IP as its tunnel end point.
And finally to complete it, we add the new vni10100 interface to the bridge interface.
The command ifreload -a kicks all of the new interface information into gear.
root@sw1:/home/jvp# net show int State Name Spd MTU Mode LLDP Summary ----- ---------- --- ----- ------------- -------------- ------------------ UP lo N/A 65536 Loopback IP: 127.0.0.1/8 lo IP: 10.100.0.3/32 lo IP: ::1/128 UP swp4 1G 9216 NotConfigured router1 (swp5) UP swp5 1G 9216 NotConfigured router2 (swp5) UP swp6 1G 1500 Access/L2 sv1 (swp4) Master: bridge(UP) UP bridge N/A 1500 Bridge/L2 UP vlan100 N/A 1500 Interface/L3 IP: 172.16.0.2/24 UP vlan100-v0 N/A 1500 Interface/L3 IP: 172.16.0.1/24 UP vni10100 N/A 9000 Access/L2 Master: bridge(UP)
Of course, we have our server connected to swp6, which we haven’t been able to get to yet. Can we?
root@sw1:/home/jvp# ping 172.16.0.10 PING 172.16.0.10 (172.16.0.10) 56(84) bytes of data. 64 bytes from 172.16.0.10: icmp_seq=1 ttl=64 time=0.000 ms 64 bytes from 172.16.0.10: icmp_seq=2 ttl=64 time=0.362 ms 64 bytes from 172.16.0.10: icmp_seq=3 ttl=64 time=0.291 ms 64 bytes from 172.16.0.10: icmp_seq=4 ttl=64 time=0.295 ms ^C --- 172.16.0.10 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3002ms rtt min/avg/max/mdev = 0.000/0.237/0.362/0.139 ms
Yep.
We perform the same changes to the interfaces file on switch 2, and we can ping the server locally attached there. The only difference is the real IP we assign to VLAN 100: make that 172.16.0.3/24.
root@sw2:/home/jvp# net show int State Name Spd MTU Mode LLDP Summary ----- ---------- --- ----- ------------- -------------- ------------------ UP lo N/A 65536 Loopback IP: 127.0.0.1/8 lo IP: 10.100.0.4/32 lo IP: ::1/128 UP swp4 1G 9216 NotConfigured router1 (swp6) UP swp5 1G 9216 NotConfigured router2 (swp6) UP swp6 1G 1500 Access/L2 sv2 (swp4) Master: bridge(UP) UP bridge N/A 1500 Bridge/L2 UP vlan100 N/A 1500 Interface/L3 IP: 172.16.0.3/24 UP vlan100-v0 N/A 1500 Interface/L3 IP: 172.16.0.1/24 UP vni10100 N/A 9000 Access/L2 Master: bridge(UP) root@sw2:/home/jvp# ping 172.16.0.11 PING 172.16.0.11 (172.16.0.11) 56(84) bytes of data. 64 bytes from 172.16.0.11: icmp_seq=1 ttl=64 time=0.022 ms 64 bytes from 172.16.0.11: icmp_seq=2 ttl=64 time=0.392 ms 64 bytes from 172.16.0.11: icmp_seq=3 ttl=64 time=0.374 ms --- 172.16.0.11 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 0.022/0.262/0.392/0.171 ms
Changes to /etc/frr/frr.conf
The changes to the frr.conf file are pretty simple, or so I thought…
address-family l2vpn evpn neighbor spine activate advertise-all-vni exit-address-family
Activate the spine peer-group, and make sure to advertise any and all VNIs. In my case, VNI10100 would be the only one. But I wasn’t seeing any EVPN announcements. The only thing switch 1 saw was itself. Hm.
root@sw1:/home/jvp# net show bgp evpn route vni 10100 BGP table version is 20, local router ID is 10.100.0.3 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP] EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP] EVPN type-5 prefix: [5]:[ESI]:[EthTag]:[IPlen]:[IP] Network Next Hop Metric LocPrf Weight Path *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99] 10.100.0.3 32768 i *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99]:[32]:[172.16.0.10] 10.100.0.3 32768 i *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99]:[128]:[fe80::a00:27ff:fe36:6299] 10.100.0.3 32768 i *> [3]:[0]:[32]:[10.100.0.3] 10.100.0.3 32768 i Displayed 4 prefixes (4 paths)
The switch isn’t seeing the MAC address announcement from switch 2. What if I check one of the spines?
root@router1:/etc/network# net show bgp evpn route BGP table version is 12, local router ID is 10.100.0.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP] EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP] EVPN type-5 prefix: [5]:[ESI]:[EthTag]:[IPlen]:[IP] Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 10.100.0.3:2 *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99] 10.100.0.3 0 65201 i *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99]:[32]:[172.16.0.10] 10.100.0.3 0 65201 i *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99]:[128]:[fe80::a00:27ff:fe36:6299] 10.100.0.3 0 65201 i *> [3]:[0]:[32]:[10.100.0.3] 10.100.0.3 0 65201 i Route Distinguisher: 10.100.0.4:2 *> [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e] 10.100.0.4 0 65201 i *> [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e]:[32]:[172.16.0.11] 10.100.0.4 0 65201 i *> [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e]:[128]:[fe80::a00:27ff:fe36:f24e] 10.100.0.4 0 65201 i *> [3]:[0]:[32]:[10.100.0.4] 10.100.0.4 0 65201 i Displayed 8 prefixes (8 paths)
The spine sees announcements from both leaves. But switch 1 isn’t hearing… or wait… maybe it’s not accepting the announcements from switch 2!? It turns out that’s exactly what the problem was. It was the output from the spine that gave me the hint I needed: the ASN of 65201 attached to the announcements. The switches were set up to allow their own ASNs in in the IPv4 family but not the l2vpn evpn family!
Derp.
address-family l2vpn evpn neighbor spine activate neighbor spine allowas-in origin advertise-all-vni exit-address-family
And now:
root@sw1:/home/jvp# net show bgp evpn route vni 10100 BGP table version is 20, local router ID is 10.100.0.3 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP] EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP] EVPN type-5 prefix: [5]:[ESI]:[EthTag]:[IPlen]:[IP] Network Next Hop Metric LocPrf Weight Path *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99] 10.100.0.3 32768 i *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99]:[32]:[172.16.0.10] 10.100.0.3 32768 i *> [2]:[0]:[0]:[48]:[08:00:27:36:62:99]:[128]:[fe80::a00:27ff:fe36:6299] 10.100.0.3 32768 i * [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e] 10.100.0.4 0 65200 65201 i *> [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e] 10.100.0.4 0 65200 65201 i * [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e]:[32]:[172.16.0.11] 10.100.0.4 0 65200 65201 i *> [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e]:[32]:[172.16.0.11] 10.100.0.4 0 65200 65201 i * [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e]:[128]:[fe80::a00:27ff:fe36:f24e] 10.100.0.4 0 65200 65201 i *> [2]:[0]:[0]:[48]:[08:00:27:36:f2:4e]:[128]:[fe80::a00:27ff:fe36:f24e] 10.100.0.4 0 65200 65201 i *> [3]:[0]:[32]:[10.100.0.3] 10.100.0.3 32768 i * [3]:[0]:[32]:[10.100.0.4] 10.100.0.4 0 65200 65201 i *> [3]:[0]:[32]:[10.100.0.4] 10.100.0.4 0 65200 65201 i Displayed 8 prefixes (12 paths)
The switches each see the servers on the other. Remember that 172.16.0.11 is on switch 2:
root@sw1:/home/jvp# ping 172.16.0.11 PING 172.16.0.11 (172.16.0.11) 56(84) bytes of data. 64 bytes from 172.16.0.11: icmp_seq=1 ttl=64 time=0.884 ms 64 bytes from 172.16.0.11: icmp_seq=2 ttl=64 time=1.06 ms 64 bytes from 172.16.0.11: icmp_seq=3 ttl=64 time=1.03 ms 64 bytes from 172.16.0.11: icmp_seq=4 ttl=64 time=1.03 ms ^C --- 172.16.0.11 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3008ms rtt min/avg/max/mdev = 0.884/1.004/1.065/0.080 ms root@sw1:/home/jvp# net show route 172.16.0.11 RIB entry for 172.16.0.11 ========================= Routing entry for 172.16.0.0/24 Known via "connected", distance 0, metric 0 Last update 16:47:43 ago * directly connected, vlan100-v0 Routing entry for 172.16.0.0/24 Known via "connected", distance 0, metric 0, best Last update 16:47:43 ago * directly connected, vlan100 FIB entry for 172.16.0.11 ========================= 172.16.0.0/24 dev vlan100 proto kernel scope link src 172.16.0.2 172.16.0.0/24 dev vlan100-v0 proto kernel scope link src 172.16.0.1 metric 1024 root@sw1:/home/jvp# arp -an 172.16.0.11 arp: in 4 entries no match found. root@sw1:/home/jvp# net show evpn arp-cache vni 10100 Number of ARPs (local and remote) known for this VNI: 8 IP Type MAC Remote VTEP fe80::a00:27ff:fe36:f24e remote 08:00:27:36:f2:4e 10.100.0.4 172.16.0.1 local 44:39:39:ff:40:94 172.16.0.11 remote 08:00:27:36:f2:4e 10.100.0.4 172.16.0.10 local 08:00:27:36:62:99 fe80::4639:39ff:feff:4094 local 44:39:39:ff:40:94 fe80::a00:27ff:fe36:6299 local 08:00:27:36:62:99 fe80::a00:27ff:fec1:571a local 08:00:27:c1:57:1a 172.16.0.2 local 08:00:27:c1:57:1a
Checking from the Mac:
deadshot# ping 172.16.0.10 PING 172.16.0.10 (172.16.0.10): 56 data bytes 64 bytes from 172.16.0.10: icmp_seq=0 ttl=61 time=1.253 ms 64 bytes from 172.16.0.10: icmp_seq=1 ttl=61 time=1.029 ms 64 bytes from 172.16.0.10: icmp_seq=2 ttl=61 time=1.260 ms ^C --- 172.16.0.10 ping statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 1.029/1.181/1.260/0.107 ms deadshot# ping 172.16.0.11 PING 172.16.0.11 (172.16.0.11): 56 data bytes 64 bytes from 172.16.0.11: icmp_seq=0 ttl=61 time=1.091 ms 64 bytes from 172.16.0.11: icmp_seq=1 ttl=61 time=1.100 ms 64 bytes from 172.16.0.11: icmp_seq=2 ttl=61 time=1.161 ms ^C --- 172.16.0.11 ping statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 1.091/1.117/1.161/0.031 ms deadshot# telnet 172.16.0.11 22 Trying 172.16.0.11... Connected to 172.16.0.11. Escape character is '^]'. SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u4
I can ping both of them and get to their SSH ports. Cool.
One last check is from the servers themselves. If they’re on the same L2 broadcast domain, they’ll have ARP entries for each other. Let’s check one of the servers:
root@sv1:/home/jvp# ping 172.16.0.11 PING 172.16.0.11 (172.16.0.11) 56(84) bytes of data. 64 bytes from 172.16.0.11: icmp_seq=1 ttl=64 time=1.94 ms 64 bytes from 172.16.0.11: icmp_seq=2 ttl=64 time=1.35 ms 64 bytes from 172.16.0.11: icmp_seq=3 ttl=64 time=0.952 ms ^C --- 172.16.0.11 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 0.952/1.416/1.940/0.407 ms root@sv1:/home/jvp# arp -an ? (172.16.0.4) at 42:cd:17:09:60:b4 [ether] on swp4 ? (172.16.0.2) at 08:00:27:c1:57:1a [ether] on swp4 ? (172.16.0.1) at 44:39:39:ff:40:94 [ether] on swp4 ? (172.16.0.11) at 08:00:27:36:f2:4e [ether] on swp4
Success.
Optimizing North-South Traffic
If I wanted to follow the “traditional” VXLAN leaf/spine architecture, I’d call myself done with the previous exercise. But: I’m not done because I don’t want to follow that architecture. The problem with the network that I have set up right now is inefficient routing to each server from off-net. I’ll explain. Ignoring the aggregation router here: the entry point into the entire network is through the spines. They know about the 172.16.0.0/24 prefix via both switches. Meaning they’ll ECMP incoming traffic that’s destined for them.
root@router1:/etc/network# net show route 172.16.0.0/24 RIB entry for 172.16.0.0/24 =========================== Routing entry for 172.16.0.0/24 Known via "bgp", distance 20, metric 0, best Last update 01:10:35 ago * fe80::a00:27ff:fec7:fe0c, via swp5 * fe80::a00:27ff:fed5:26e6, via swp6 FIB entry for 172.16.0.0/24 =========================== 172.16.0.0/24 proto bgp metric 20 nexthop via 169.254.0.1 dev swp5 weight 1 onlink nexthop via 169.254.0.1 dev swp6 weight 1 onlink
There’s a 50% chance that the packet will come in for 172.16.0.10 and be sent to switch 1. Meaning there’s a 50% chance it’ll be sent to switch 2. If the former: that’s good. If the latter, it means switch 2 will encapsulate the packet into a VXLAN tunnel, send it back to the spines destined for switch 1. Switch 1 will decapsulate it and deliver it to the server. This works fine, but is somewhat inefficient.
And the more leaves you have, the greater the chance you have of the packet getting routed inefficiently.
This is one reason the service leaf continues to exist today: because it has VXLAN tunnels with each of the downstream leaves, it knows exactly which leaf the server is on, it encaps the packet into a VXLAN tunnel, and sends it directly to the right leaf.
VXLAN On The Spine
Doing VXLAN on the spine is the other way to solve this problem, and makes for a more efficient and reliable network architecture. It means we need to add the VLANs, bridge interfaces, and VNIs to the spines. So that has to be kept in mind: this isn’t a freebie. More configuration is necessary. But the trade-off is: less networking devices to have to manage. I think that’s a win, personally.
Changes to /etc/network/interfaces
The changes are just like what I added to the switches earlier:
auto bridge iface bridge bridge-ports vni10100 bridge-vids 100 bridge-vlan-aware yes auto vlan100 iface vlan100 address 172.16.0.4/24 vlan-id 100 vlan-raw-device bridge iface vni10100 bridge-access 100 bridge-arp-nd-suppress on bridge-learning off vxlan-id 10100 vxlan-local-tunnelip 10.100.0.1 mtu 9000
Reloading the interface config gets a whole new spine:
root@router1:/etc/network# net show int State Name Spd MTU Mode LLDP Summary ----- -------- --- ----- ------------- -------------- ------------------ UP lo N/A 65536 Loopback IP: 127.0.0.1/8 lo IP: 10.100.0.1/32 lo IP: ::1/128 UP swp4 1G 9216 NotConfigured router2 (swp4) UP swp5 1G 9216 NotConfigured sw1 (swp4) UP swp6 1G 9216 NotConfigured sw2 (swp4) UP swp7 1G 1500 Interface/L3 agg (swp5) IP: 10.0.3.1/31 UP bridge N/A 9000 Bridge/L2 UP vlan100 N/A 9000 Interface/L3 IP: 172.16.0.4/24 UP vni10100 N/A 9000 Access/L2 Master: bridge(UP)
And I can still ping the server in question. But the difference is: I have ARP knowledge of it now, via VXLAN!
root@router1:/etc/network# ping 172.16.0.10 PING 172.16.0.10 (172.16.0.10) 56(84) bytes of data. 64 bytes from 172.16.0.10: icmp_seq=1 ttl=64 time=0.754 ms 64 bytes from 172.16.0.10: icmp_seq=2 ttl=64 time=1.12 ms ^C --- 172.16.0.10 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.754/0.939/1.125/0.188 ms root@router1:/etc/network# net show route 172.16.0.10 RIB entry for 172.16.0.10 ========================= Routing entry for 172.16.0.0/24 Known via "connected", distance 0, metric 0, best Last update 00:01:13 ago * directly connected, vlan100 FIB entry for 172.16.0.10 ========================= 172.16.0.0/24 dev vlan100 proto kernel scope link src 172.16.0.4 root@router1:/etc/network# net show evpn arp-cache vni 10100 Number of ARPs (local and remote) known for this VNI: 6 IP Type MAC Remote VTEP fe80::a00:27ff:fe36:f24e remote 08:00:27:36:f2:4e 10.100.0.4 fe80::40cd:17ff:fe09:60b4 local 42:cd:17:09:60:b4 172.16.0.4 local 42:cd:17:09:60:b4 172.16.0.11 remote 08:00:27:36:f2:4e 10.100.0.4 172.16.0.10 remote 08:00:27:36:62:99 10.100.0.3 fe80::a00:27ff:fe36:6299 remote 08:00:27:36:62:99 10.100.0.3
What this means is that the packet will come into either of the spines, they’ll immediately encapsulate the packet, and send it to the appropriate VTEP. In this instance, that means switch 1.
What Else Can We Do?
If we’re very careful with L2 applications on the spines, we can do a few more things without the need for those service leaves. One “for instance” is a load balancer that may need L2 access to a bunch of the servers on various leaves. If we assume our load balancer is too stupid and incapable of participating in VXLAN and it needs L2 access, that means you have to trunk L2 to it. And that same L2 needs to be available to the servers on the leaves. So attach it to the spines on L2 trunk ports, and add those into the same VNI as the server VLAN(s). If your company has decided it needs HA load balancers (which are rarely a good idea), that means you need a contiguous L2 broadcast domain between them for their virtual default route. Again: this is easy. Connect LB1 to spine 1, LB2 to spine 2. Set up a trunk interface on each spine with the same VLAN, and then use VXLAN between the two spines to carry that L2 information. The LBs will see each other via that path, and determine which is active and which is standby. And because that virtual IP is on the same VLAN as the downstream servers, the servers will be able to use it.
HA firewalls are the same way. Again: these abominations of devices need a contiguous L2 broadcast domain so that they can figure out which firewall is active and which is standby. And again: this can easily be accomplished on the spines with careful trunking and VXLAN’ing.
Appendices
Appendix A – Spine 1 Configuration Files
/etc/network/interfaces
# The loopback network interface auto lo iface lo inet loopback address 10.100.0.1/32 auto swp4 iface swp4 mtu 9216 auto swp5 iface swp5 mtu 9216 auto swp6 iface swp6 mtu 9216 auto swp7 iface swp7 address 10.0.3.1/31 auto bridge iface bridge bridge-ports vni10100 bridge-vids 100 bridge-vlan-aware yes auto vlan100 iface vlan100 address 172.16.0.4/24 vlan-id 100 vlan-raw-device bridge iface vni10100 bridge-access 100 bridge-arp-nd-suppress on bridge-learning off vxlan-id 10100 vxlan-local-tunnelip 10.100.0.1 mtu 9000
/etc/frr/frr.conf
router bgp 65200 bgp router-id 10.100.0.1 bgp bestpath as-path multipath-relax redistribute connected ! neighbor swp7 interface neighbor swp7 remote-as external neighbor swp7 capability extended-nexthop ! neighbor swp4 interface neighbor swp4 remote-as internal neighbor swp4 next-hop-self ! neighbor leaf peer-group neighbor leaf remote-as external ! neighbor swp5 interface peer-group leaf neighbor swp6 interface peer-group leaf ! ! IPv4 address-family ipv4 unicast redistribute connected neighbor leaf activate exit-address-family ! ! EVPN config address-family l2vpn evpn neighbor leaf activate neighbor swp4 activate advertise-all-vni ! exit-address-family
Appendix B – Spine 2 Configuration Files
/etc/network/interfaces
# The loopback network interface auto lo iface lo inet loopback # The primary network interface address 10.100.0.2/32 auto swp4 iface swp4 mtu 9216 auto swp5 iface swp5 mtu 9216 auto swp6 iface swp6 mtu 9216 auto swp7 iface swp7 address 10.0.3.3/31 auto bridge iface bridge bridge-ports vni10100 bridge-vids 100 bridge-vlan-aware yes auto vlan100 iface vlan100 address 172.16.0.5/24 vlan-id 100 vlan-raw-device bridge iface vni10100 bridge-access 100 bridge-arp-nd-suppress on bridge-learning off vxlan-id 10100 vxlan-local-tunnelip 10.100.0.2 mtu 9000
/etc/frr/frr.conf
router bgp 65200 bgp router-id 10.100.0.2 bgp bestpath as-path multipath-relax ! neighbor swp7 interface neighbor swp7 remote-as external neighbor swp7 capability extended-nexthop ! neighbor swp4 interface neighbor swp4 remote-as internal neighbor swp4 next-hop-self ! neighbor leaf peer-group neighbor leaf remote-as external ! neighbor swp5 interface peer-group leaf neighbor swp6 interface peer-group leaf ! ! IPv4 address-family ipv4 unicast redistribute connected neighbor leaf activate exit-address-family ! ! EVPN config address-family l2vpn evpn neighbor leaf activate neighbor swp4 activate advertise-all-vni ! exit-address-family
Appendix C – Leaf 1 Configuration Files
/etc/network/interfaces
# The loopback network interface auto lo iface lo inet loopback # The primary network interface address 10.100.0.3/32 auto swp4 iface swp4 mtu 9216 auto swp5 iface swp5 mtu 9216 auto swp6 iface swp6 bridge-access 100 auto bridge iface bridge bridge-ports swp6 vni10100 bridge-vids 100 bridge-vlan-aware yes auto vni10100 iface vni10100 bridge-access 100 bridge-arp-nd-suppress on bridge-learning off mstpctl-bpduguard yes mstpctl-portbpdufilter yes mtu 9000 vxlan-id 10100 vxlan-local-tunnelip 10.100.0.3 auto vlan100 iface vlan100 address 172.16.0.2/24 vlan-id 100 vlan-raw-device bridge address-virtual 44:39:39:FF:40:94 172.16.0.1/24
/etc/frr/frr.conf
router bgp 65201 bgp router-id 10.100.0.3 bgp bestpath as-path multipath-relax ! neighbor spine peer-group neighbor spine remote-as external ! neighbor swp4 interface peer-group spine neighbor swp5 interface peer-group spine ! ! IPv4 address-family ipv4 unicast redistribute connected neighbor spine allowas-in origin exit-address-family ! ! EVPN config address-family l2vpn evpn neighbor spine activate neighbor spine allowas-in origin advertise-all-vni exit-address-family
Appendix D – Leaf 2 Configuration Files
/etc/network/interfaces
# The loopback network interface auto lo iface lo inet loopback # The primary network interface address 10.100.0.4/32 # The primary network interface #auto eth0 #iface eth0 inet dhcp auto swp4 iface swp4 mtu 9216 auto swp5 iface swp5 mtu 9216 auto swp6 iface swp6 bridge-access 100 auto bridge iface bridge bridge-ports swp6 vni10100 bridge-vids 100 bridge-vlan-aware yes auto vni10100 iface vni10100 bridge-access 100 bridge-arp-nd-suppress on bridge-learning off mstpctl-bpduguard yes mstpctl-portbpdufilter yes mtu 9000 vxlan-id 10100 vxlan-local-tunnelip 10.100.0.4 auto vlan100 iface vlan100 address 172.16.0.3/24 vlan-id 100 vlan-raw-device bridge address-virtual 44:39:39:FF:40:94 172.16.0.1/24
/etc/frr/frr.conf
router bgp 65201 bgp router-id 10.100.0.4 bgp bestpath as-path multipath-relax ! neighbor spine peer-group neighbor spine remote-as external ! neighbor swp4 interface peer-group spine neighbor swp5 interface peer-group spine ! ! IPv4 address-family ipv4 unicast redistribute connected neighbor spine allowas-in origin exit-address-family ! ! EVPN config address-family l2vpn evpn neighbor spine activate neighbor spine allowas-in origin advertise-all-vni exit-address-family
3 thoughts on “EVPN and VXLAN on Cumulus”