Failover and High-Availability

Is it normal that I see bandwidth utilization on the slave unit’s web interface?

Yes, it is definitely normal and not limited to the web interface, the CLI behaves the same way.

The GMAC information including speed and bandwidth usage is replicated from Master to Slave (when possible dependent on GMAC type). This is done in preparation for a possible future flip from slave to master while retaining those statistics even through a master unit failure.

The switch blocks the Link LB ports.

In failover mode with multiple VFI using the same switch, the spanning tree protocol detects more than one path through the Link LB because the LinkLB is transparent at OSI layer 2. The solution is to deactivate spanning tree on the switch or use the feature l2_filtering enable. This feature will filter any packet other than IP(0x0800) or ARP (0x0806).

The Link LB reports an error in the SYSTEM logs: I-VFI0-020001 2 Arp for mac address [00:50:C2:55:20:0F] received, it belong to my interface [eth1], possible LOOP detected

The Link LB, with its inline operation, can be viewed as a bridge or hub in passthrough mode. Internal verifications are performed to prevent an inside/outside or outside/outside port to be “short-circuited”.

Each time a cable is connected (or a configuration change is done), the switch tries to detect possible loops and the Link LB inside and outside ports are in separate VLANs but detects each other at OSI layer 2 for a few second. This creates instability of the service. The solution is to deactivate the spanning tree on the switch and use the feature l2_filtering enable. This feature will filter any packet other than IP(0x0800) or ARP (0x0806).

This is also very important in failover mode to prevent a switch between the master and slave units.

In failover mode, which features are required for passthrough?

  • feature arp_reply_in enable.
  • feature arp_reply_out enable.
  • feature arp_learn_in enable.
  • feature arp_learn_out enable.

In EOS version 3.1.20, those four commands can be achieved with the single command feature arp group enable.

What is the difference between pause and resume?

NOTE: Be very careful with the pause and resume commands, as outlined below they exist in several different module and will have different behaviors based on the module you execute them in.

In FOVE module on the master unit, the pause command will stop forwarding packets for all VFIs and release the master state. In FOVE module on the slave unit, the pause command will release the slave state so this unit will not accept any further update from the master or take the master state.

In VFI module on the master unit, the pause command will stop forwarding all packets for this specific VFI and will not release the master state. In VFI module on the slave unit, the pause command will have no result because the VFI modules on the slave unit are already in pause.

The resume command must be used in FOVE module after a pause command has been issued to resume operation.

What are the limitations of LAN failsafe bypass function compared to having two Elfiq units in failover/high availability?

The Elfiq E Series models have a new hardware capability which could further improve the resilience of your network. The LAN failsafe feature will allow the unit to continue passing traffic, even in the unlikely event of a complete hardware failure. Here are some of the advantages of LAN failsafe as well as some considerations.

  • The LAN failsafe will allow the firewall to maintain communication with the primary link router, even if the unit experiences a hardware failure. Connectivity of the primary link is preserved.
  • Some security devices could refuse to update their MAC address which they have for the primary link router before a certain delay. This delay is often configurable and we recommend to verify your firewall configuration for this parameter.
  • Incoming IP services that were load balanced require a DNS server to answer incoming requests with addresses from the primary link. Elfiq recommends that you always have two physical devices resolving DNS requests. This backup DNS service could be on a server in the DMZ. The same requirement does not apply when two Elfiq units in failover are in place, since the IDNS records reside on two hardware units.
    A Failover implementation using two physical devices is much more reliable to ensure resilience of your network because incoming services and usage of all links is preserved in case of a hardware failure. Note that the LAN failsafe bypass function can NOT be used in a failover/high availability scenario.

What is the best way to upgrade my Link LB units that are currently in failover to minimize downtime?

Upgrade the slave unit first, then after the reboot of the slave unit, swap the roles between Master and Slave. Once the roles have been swapped, upgrade the new slave and reboot.  Here is a more detailed procedure:

On the slave unit :

  • Go to enable mode
  • Proceed to the “syst” module and issue the eosupdate command.
  • Fill in the requested information.
  • Reboot the slave unit. (reload command)
  • Once rebooted log in and check the status.

On the master unit

  • Ensure you are in the “fove” module by checking the prompt
  • Issue the “pause” command
  • The unit will go in paused mode and stop passing traffic, the other Link LB will automatically pick-up the Master role (3-5 seconds of downtime).
  • Proceed to the “syst” module and issue the eosupdate command.
  • Fill in the requested information.
  • Reboot the paused unit. (reload command)

After the 2nd Link LB has finished rebooting it will go straight to SLAVE mode.

 

Rebooting the slave unit seems to have an effect on the traffic going through the switches. I double checked the configuration and everything seems ok. Is there something I can do?

If  Spanning tree enabled on the switch, please ensure that you either:
1- Disable it for the vlans or ports connect to the Link LB for the inside and the outside vlans
2- Disable it altogether in the switch if possible
3- If you can’t disable it, have all the ports connected to the Link LB configured with PortFast.
The reasoning behind is that when the unit is shutdown then restarted, all the corresponding ports in the switch start the spanning tree process. If the Link LB unit finishes booting before the spanning tree process is complete, then the Link LB is isolated because it cannot see anything on it’s Ethernet interfaces and communicate with the other master Link LB. Therefore, the newly booted Link LB goes to master mode because it thinks it is alone. The unit will then assign the virtual MAC addresses to its interfaces, those virtual MAC addresses are only assigned if the unit is MASTER. Once the spanning tree is finally complete then the ports are “opened” and both units claim the same MAC addresses, at this point there is a “collision” because both units are trying to use the same source MAC.
One unit will go into NEGOTIATING mode and will back down to SLAVE but during the short time where both units are MASTER you will disrupt traffic flows.