No Management, but Passing Traffic

Updated August 20, 2018 06:00

BEGIN TL-45 #1 <-> TL-45 #2 <-> TL-45 #3 <-> TL-45 #4 END

#1 drops off the radar, can't ping, can't telnet to it, but you can ping or telnet to #2, #3, #4 while #1 is "down". To be clear the only path to 2,3,or 4 is through 1. Also it isn't just 1, it is all 4 that exhibit the same behavior.

#4 is currently waiting for its sibling #5 to come on-line, which is waiting on its sibling #6 to come on-line, which makes #4 the end of the line as of now. So I am trying to figure out why this tends to happen at regular/random intervals.

5.8 GHz - Rf power = 17, RSSI = mid to low 60's. The only traffic across these radios is from monitoring agents at the moment.

Comments

11 comments

Scott Chester March 20, 2013 15:14

It sounds like the management just stops responding. What triggers it to come back or does it start responding again on it's own without any intervention?

If they come and go on their own, approximately how long is the unit up before you see the problem and how long is it down before it comes back?

Is there a time that all of them are reachable or is one of them always down? I could see this occurring if there was a duplicate MAC address.

0

Comment actions Permalink
Bill Allen March 20, 2013 15:53

A1: It starts responding on its own without intervention. Also while "down" it will not respond to snmp requests. It is as if the ethernet interface stops/locks, then clears. Q2: If they come and go on their own, approximately how long is the unit up before you see the problem and how long is it down before it comes back? A2: The length of time seems random, but if I ping across the link to distant radio, it appears like it comes back up quicker. Emphasis on "appears". The time between failure is much longer than the time to recover. Time between failure can be anywhere from 5-6 minutes to a couple hours on each device. But a constant is the time they take to recover …. 2 minutes 50 seconds to 2 minutes 52 seconds across all devices. A variation of the polling I'm sure. Q3: Is there a time that all of them are reachable or is one of them always down? I could see this occurring if there was a duplicate MAC address. A3: They are all up for a time and then one or all might go down, or any variation of one or more. The next time I can catch one down I will do a MAC scan to see if I am having any conflicts between IP's or MACs.

0

Comment actions Permalink
Bill Allen March 20, 2013 17:07

I was able to run a couple scans while experiencing difficulties. I scanned the entire subnet 12 times for IP/MAC conflicts and did not have any. I then scanned just the link 7 times while experiencing difficulties a second time with the same result, no conflicts. Also the following are ping latency across the link to the far side during normal conditions and then with one down in the middle. All radios responding 222 packets transmitted, 222 received, 0% packet loss, time 221307ms rtt min/avg/max/mdev = 6.999/10.427/42.739/4.541 ms One radio in the middle not responding. 72 packets transmitted, 71 received, 1% packet loss, time 71102ms rtt min/avg/max/mdev = 7.296/10.204/30.212/4.153 ms

0

Comment actions Permalink
Bill Allen March 20, 2013 21:40

To add insult to injury, I ran a link test between the two end devices.

Link Test Results:

Duration: 10 min(s)
                   Tx Rate   Tx Bytes           Error Rate
MU to RU   54           494400000   0.00
RU to MU   54           494400000   0.00                         Aggregated Bandwidth: 42.70 Mbps

0

Comment actions Permalink
Bill Allen March 20, 2013 21:54

Radio #1 --> #2

Test Results:

Duration: 10 min(s)
                    Tx Rate   Tx Bytes               Error Rate
MU to RU   54           468800000       1.00
RU to MU   54           468800000       13.00               Aggregated Bandwidth: 34.45 Mbps

0

Comment actions Permalink
Bill Allen March 25, 2013 20:11

This happens with all our TL-45 links, not just the one described above. I picked it because isn't currently carrying traffic, allowing me to test and troubleshoot it, where the others are live. Does broadcast domain affect the TL-45? Because they are all in one domain, layer 2 works fine, layer 3 is hit and miss.

Thanks.

0

Comment actions Permalink
Scott Chester March 25, 2013 21:41

Bill, it is strange that I don't recall hearing reports of this from other users, but maybe they aren't monitoring it as closely as you are.

I monitor a TL45 link on our production network and don't see the same issue.

The TL45 would be impacted to a minor extent by broadcast packets as it has to process them, but it would take a semi significant amount of traffic before that would be noticeable and it should only cause it to not respond while the traffic load is high, not for nearly exactly 3 minutes like you're seeing.

Out of curiosity, are you using the standard ping or the multiple ping request type? I doubt it would make any difference, but it would eliminiate the chances of one dropped ping being recorded.

1hr.png
30days.png
2days.png
365days.png

0

Comment actions Permalink
Bill Allen March 26, 2013 15:19

Well thats a good question ....

We have a program called Pathview, I have not looked at it that much it was here before me, but I believe it is pinging every minute and using return times to estimate path quality and keeps a history.

We also use another monitor that pings every minute and is displayed in our common area in map format to give a quick and dirty live view, with limited history.

Also our recently added PRTG demo which is pulling snmp info only, to give a deeper look into the device data, and only pings devices that do not have snmp support.

So currently we have three monitoring systems, trying to determine network health. PRTG will replace one if not both of the other systems. At this point I have checked software revision, and other settings to see if something stands out. We need to know relatively quickly if a link drops hence the polling time frame.

All three monitors are in relative sync in recording these events. At this point I'm am confused, I am monitoring the endpoints as well which do not "drop" but are stable, which only confirms what we already know. The radios are passing traffic and performing well in my opinion, but accessing the radio by IP is erratic. The data rates are well within the radios capabilities, but the false positives on the monitors is a serious problem for us.

Thanks,

bill

0

Comment actions Permalink
Bill Allen April 08, 2013 18:09

Consider this closed.

Oddly while testing PRTG monitor, I suspected that one of our trunks was performing poorly. I logged into each trunk radio and though rssi values looked good at both ends, ARQ resends were high on one end. I slowed the trunk speed from 36Mbps to 24Mbps, without any other changes and the problem seems to have stopped on our entire network.

This may have something to do with our topology, but is still a mystery to me that one link would be this disruptive. Thank you for your support, now I need to get the monitoring fine tuned to see what is really going on. Keep your fingers crossed, I am, I'm currently typing this with one hand.

bill

0

Comment actions Permalink
Scott Chester May 13, 2013 14:31

Bill, have things been solid for the past month since the change?

0

Comment actions Permalink
Bill Allen May 13, 2013 18:51

Yes it has.

The network as a whole has become more predictable. The monitoring is stable and near completion, except for reporting.

I still don't understand how the modification could make such a dramatic change, the only thing I can think of was if it was trashing the core switch. For multiple trunks to have "issues" then clear with one change to one path is odd. We have been running clean, in the 99.98% uptime range for a while now.

Our largest customer called our management independently, confirming this by stating they are observing the same uptimes with their gear.

bill

PS: I still haven't solved why the Giga radios aren't passing LLDP.

0

Comment actions Permalink

Please sign in to leave a comment.

Articles in this section

Related articles