NSX and vCloud Director lab problems
I had to do some NSX troubleshooting on a lab environment running vCloud Director and NSX yesterday with a colleague. Apparently something broke in our lab during an NSX upgrade and the quick ad dirty decision was made to just reinstall NSX. As it turns out, the hosts were not properly unprepared and such. I don’t know much of the details of what went wrong or what troubleshooting steps were already executed but by the time I was asked to have a look, judging from the Web Client, it seemed everything was running fine NSX wise:
- The hosts in the Compute Cluster were correctly configured and prepared;
- The VXLAN Tunnel Endpoints (VTEPs) were correctly configured;
- The NSX Manager and Controllers were running fine;
- and so on…
NSX Troubleshooting VXLAN
But … Virtual Machines running on different ESXi hosts connected to the same Logical Switch could not communicate with each other. VXLAN traffic was not being handled properly. Virtual Machines on the same ESXi host could communicate properly with each other. So we started troubleshooting L2 Connectivity. We used Roie Ben Haim‘s excellent NSX L2 troubleshooting guide on his blog http://www.routetocloud.com. This guide runs you to a series of structured, well laid out steps to find the root cause. Two bits of information stood out during this process:
- Checking the status of the ‘Control Plane’ it showed a status of ‘Disabled’ for all VNIs. Below a sample screenshot of the output of the net-vdl2 command where the Control Plane is ‘Disabled’:
- There was no established connection on TCP1234 between the ESXi hosts and the NSX Controllers when running esxcli network ip connection list | grep 1234
With the NSX upgrade problems and the quick & dirty reinstallation of NSX-v in my mind, we embarked on some advanced command line troubleshooting on the NSX Controllers. We kind of jumped to the conclusion the NSX reinstallation must have broke “something”. Remember, we already found the NSX Control Plane was ‘Disabled’ for each and every VNI. Long story short, we could not find a root cause.
Tracing back our steps
At this point we decided to trace back our steps and start over with the NSX troubleshooting process. That’s when my colleague mentioned he had to re-create the networking configuration in vCloud Director due to the NSX reinstallation. That’s when it hit me. vCloud Director defaults to Multicast Replication Mode when Transport Zones are created. As it turns out: all Logical Switches were configured in Multicast Replication Mode. The physical network was not correctly configured for Multicast Replication Mode. When we changed it to Unicast Mode, Virtual Machines could communicate and VXLAN traffic was flowing happily in our lab. Duh!
Lesson learned: don’t jump to conclusions when troubleshooting
The failed NSX upgrade and the quick and dirty reinstallation clouded my judgement. I assumed something was horribly wrong with the NSX Controllers because the Control Plane had a status of ‘Disabled’ for every VNI and there were no TCP connecties being established between the ESXi host and the NSX Controllers. Well, NSX Controllers are bypassed when Multicast Replication Mode is being used so thats true behaviour by design there… It was just a simple configuration error.
I guess it was just one of those days…