Tuesday, May 21, 2013

Investigating frame errors in ifconfig output

Today I've been alerted about network issues on one of the Linux servers. Fortunately Geneos monitoring tools are analyzing ifconfig output (where most of the other tools doesn't).
We have found out excessive frame errors in ifconfig output and a lot of rx_crc_errors in ethtool output.
What made things more interesting is that it was observed on two different servers connected to the same switch (which gave us a clue that there might be something wrong with the switch itself).

Layer 1 issues

Most people advise to check cables or hardware (NIC/switch) as rx_crc_errors indicates layer 1 issues. It might be the case if you have problems on one hosts only but having the same issue on different hosts from the same subnet made the switch guilty from the very beginning.

ifconfig and ethtool outputs

From ifconfig errors and frame counter were raising:

eth0      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx  
          ...
          RX packets:277593775 errors:12013 dropped:0 overruns:0 frame:11763

I've started monitoring this using:
# for i in `seq 1 100`; do ifconfig eth0 | grep frame; sleep 1; done
RX packets:277593775 errors:12128 dropped:0 overruns:0 frame:11877
RX packets:277593775 errors:12135 dropped:0 overruns:0 frame:11884
RX packets:277593775 errors:12143 dropped:0 overruns:0 frame:11892
(...)

When checking ethtool -S eth0 rx_crc_errors were raising in the same rate.

Verify NIC settings

Run ethtool eth0 to see the current speed and duplex:
# ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: d
        Current message level: 0x00000007 (7)
        Link detected: yes

You may also check dmesg to find out if there were any changes for eth0:
# dmesg | grep eth0
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
8021q: adding VLAN 0 to HW filter on device eth0
e1000: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX

Check switch settings

Finally I've asked network guys to verify switch setup.
It appeared that they've replaced old switch recently and put new one with slightly different settings. All the ports on the switch were set to auto:auto (speed & duplex).
They have found out that somehow switch has negotiated 100Mbit half duplex instead of 100 full duplex for all the servers' connections.

We have fixed this issue by setting up 100Mbit full duplex on all required port on the switch.

How to reset ifconfig counters?

After this issue we had a lot of errors logged on the interfaces. Unfortunately resetting these counters may be done only in two ways:
  1. reload NIC drivers module (modprobe -r module; modprobe module)
  2. reboot the box
If you're not sure what module to unload check ethtool -i eth0 output:
# ethtool -i eth0
driver: e1000
version: 7.3.21-k4-3-NAPI


No comments:

Post a Comment