No Management, but Passing Traffic

Follow

Comments

11 comments

  • Avatar
    Scott Chester

    It sounds like the management just stops responding. What triggers it to come back or does it start responding again on it's own without any intervention?

    If they come and go on their own, approximately how long is the unit up before you see the problem and how long is it down before it comes back?

    Is there a time that all of them are reachable or is one of them always down? I could see this occurring if there was a duplicate MAC address.

    0
    Comment actions Permalink
  • Avatar
    Bill Allen
    A1: It starts responding on its own without intervention. Also while "down" it will not respond to snmp requests. It is as if the ethernet interface stops/locks, then clears. Q2: If they come and go on their own, approximately how long is the unit up before you see the problem and how long is it down before it comes back? A2: The length of time seems random, but if I ping across the link to distant radio, it appears like it comes back up quicker. Emphasis on "appears". The time between failure is much longer than the time to recover. Time between failure can be anywhere from 5-6 minutes to a couple hours on each device. But a constant is the time they take to recover …. 2 minutes 50 seconds to 2 minutes 52 seconds across all devices. A variation of the polling I'm sure. Q3: Is there a time that all of them are reachable or is one of them always down? I could see this occurring if there was a duplicate MAC address. A3: They are all up for a time and then one or all might go down, or any variation of one or more. The next time I can catch one down I will do a MAC scan to see if I am having any conflicts between IP's or MACs.
    0
    Comment actions Permalink
  • Avatar
    Bill Allen
    I was able to run a couple scans while experiencing difficulties. I scanned the entire subnet 12 times for IP/MAC conflicts and did not have any. I then scanned just the link 7 times while experiencing difficulties a second time with the same result, no conflicts. Also the following are ping latency across the link to the far side during normal conditions and then with one down in the middle. All radios responding 222 packets transmitted, 222 received, 0% packet loss, time 221307ms rtt min/avg/max/mdev = 6.999/10.427/42.739/4.541 ms One radio in the middle not responding. 72 packets transmitted, 71 received, 1% packet loss, time 71102ms rtt min/avg/max/mdev = 7.296/10.204/30.212/4.153 ms
    0
    Comment actions Permalink
  • Avatar
    Bill Allen

    To add insult to injury, I ran a link test between the two end devices.

    Link Test Results:

    Duration: 10 min(s)
                        Tx Rate    Tx Bytes            Error Rate
    MU to RU    54            494400000    0.00
    RU to MU    54            494400000    0.00                         Aggregated Bandwidth: 42.70 Mbps

    0
    Comment actions Permalink
  • Avatar
    Bill Allen

    Radio #1 --> #2

    Test Results:

    Duration: 10 min(s)
                         Tx Rate    Tx Bytes                Error Rate
    MU to RU    54            468800000        1.00
    RU to MU    54            468800000        13.00                Aggregated Bandwidth: 34.45 Mbps

    0
    Comment actions Permalink
  • Avatar
    Bill Allen

    This happens with all our TL-45 links, not just the one described above. I picked it because isn't currently carrying traffic, allowing me to test and troubleshoot it, where the others are live. Does broadcast domain affect the TL-45? Because they are all in one domain, layer 2 works fine, layer 3 is hit and miss.

    Thanks.

    0
    Comment actions Permalink
  • Avatar
    Scott Chester

    Bill, it is strange that I don't recall hearing reports of this from other users, but maybe they aren't monitoring it as closely as you are.

    I monitor a TL45 link on our production network and don't see the same issue.

    The TL45 would be impacted to a minor extent by broadcast packets as it has to process them, but it would take a semi significant amount of traffic before that would be noticeable and it should only cause it to not respond while the traffic load is high, not for nearly exactly 3 minutes like you're seeing.

    Out of curiosity, are you using the standard ping or the multiple ping request type? I doubt it would make any difference, but it would eliminiate the chances of one dropped ping being recorded.




    1hr.png
    30days.png
    2days.png
    365days.png
    0
    Comment actions Permalink
  • Avatar
    Bill Allen

    Well thats a good question ....

    We have a program called Pathview, I have not looked at it that much it was here before me, but I believe it is pinging every minute and using return times to estimate path quality and keeps a history.

    We also use another monitor that pings every minute and is displayed in our common area in map format to give a quick and dirty live view, with limited history.

    Also our recently added PRTG demo which is pulling snmp info only, to give a deeper look into the device data, and only pings devices that do not have snmp support.

    So currently we have three monitoring systems, trying to determine network health. PRTG will replace one if not both of the other systems. At this point I have checked software revision, and other settings to see if something stands out. We need to know relatively quickly if a link drops hence the polling time frame.

    All three monitors are in relative sync in recording these events. At this point I'm am confused, I am monitoring the endpoints as well which do not "drop" but are stable, which only confirms what we already know. The radios are passing traffic and performing well in my opinion, but accessing the radio by IP is erratic. The data rates are well within the radios capabilities, but the false positives on the monitors is a serious problem for us.

    Thanks,

    bill

    0
    Comment actions Permalink
  • Avatar
    Bill Allen

    Consider this closed.

    Oddly while testing PRTG monitor, I suspected that one of our trunks was performing poorly. I logged into each trunk radio and though rssi values looked good at both ends, ARQ resends were high on one end. I slowed the trunk speed from 36Mbps to 24Mbps, without any other changes and the problem seems to have stopped on our entire network.

    This may have something to do with our topology, but is still a mystery to me that one link would be this disruptive. Thank you for your support, now I need to get the monitoring fine tuned to see what is really going on. Keep your fingers crossed, I am, I'm currently typing this with one hand.

    bill

    0
    Comment actions Permalink
  • Avatar
    Scott Chester

    Bill, have things been solid for the past month since the change?

    0
    Comment actions Permalink
  • Avatar
    Bill Allen

    Yes it has.

    The network as a whole has become more predictable. The monitoring is stable and near completion, except for reporting.

    I still don't understand how the modification could make such a dramatic change, the only thing I can think of was if it was trashing the core switch. For multiple trunks to have "issues" then clear with one change to one path is odd. We have been running clean, in the 99.98% uptime range for a while now.

    Our largest customer called our management independently, confirming this by stating they are observing the same uptimes with their gear.

    bill

    PS: I still haven't solved why the Giga radios aren't passing LLDP.

    0
    Comment actions Permalink

Please sign in to leave a comment.