BEGIN TL-45 #1 <-> TL-45 #2 <-> TL-45 #3 <-> TL-45 #4 END
#1 drops off the radar, can't ping, can't telnet to it, but you can ping or telnet to #2, #3, #4 while #1 is "down". To be clear the only path to 2,3,or 4 is through 1. Also it isn't just 1, it is all 4 that exhibit the same behavior.
#4 is currently waiting for its sibling #5 to come on-line, which is waiting on its sibling #6 to come on-line, which makes #4 the end of the line as of now. So I am trying to figure out why this tends to happen at regular/random intervals.
5.8 GHz - Rf power = 17, RSSI = mid to low 60's. The only traffic across these radios is from monitoring agents at the moment.
Comments
11 comments
It sounds like the management just stops responding. What triggers it to come back or does it start responding again on it's own without any intervention?
If they come and go on their own, approximately how long is the unit up before you see the problem and how long is it down before it comes back?
Is there a time that all of them are reachable or is one of them always down? I could see this occurring if there was a duplicate MAC address.
To add insult to injury, I ran a link test between the two end devices.
Link Test Results:
Duration: 10 min(s)
Tx Rate Tx Bytes Error Rate
MU to RU 54 494400000 0.00
RU to MU 54 494400000 0.00 Aggregated Bandwidth: 42.70 Mbps
Radio #1 --> #2
Test Results:
Duration: 10 min(s)
Tx Rate Tx Bytes Error Rate
MU to RU 54 468800000 1.00
RU to MU 54 468800000 13.00 Aggregated Bandwidth: 34.45 Mbps
This happens with all our TL-45 links, not just the one described above. I picked it because isn't currently carrying traffic, allowing me to test and troubleshoot it, where the others are live. Does broadcast domain affect the TL-45? Because they are all in one domain, layer 2 works fine, layer 3 is hit and miss.
Thanks.
Bill, it is strange that I don't recall hearing reports of this from other users, but maybe they aren't monitoring it as closely as you are.
I monitor a TL45 link on our production network and don't see the same issue.
The TL45 would be impacted to a minor extent by broadcast packets as it has to process them, but it would take a semi significant amount of traffic before that would be noticeable and it should only cause it to not respond while the traffic load is high, not for nearly exactly 3 minutes like you're seeing.
Out of curiosity, are you using the standard ping or the multiple ping request type? I doubt it would make any difference, but it would eliminiate the chances of one dropped ping being recorded.
1hr.png
30days.png
2days.png
365days.png
Well thats a good question ....
We have a program called Pathview, I have not looked at it that much it was here before me, but I believe it is pinging every minute and using return times to estimate path quality and keeps a history.
We also use another monitor that pings every minute and is displayed in our common area in map format to give a quick and dirty live view, with limited history.
Also our recently added PRTG demo which is pulling snmp info only, to give a deeper look into the device data, and only pings devices that do not have snmp support.
So currently we have three monitoring systems, trying to determine network health. PRTG will replace one if not both of the other systems. At this point I have checked software revision, and other settings to see if something stands out. We need to know relatively quickly if a link drops hence the polling time frame.
All three monitors are in relative sync in recording these events. At this point I'm am confused, I am monitoring the endpoints as well which do not "drop" but are stable, which only confirms what we already know. The radios are passing traffic and performing well in my opinion, but accessing the radio by IP is erratic. The data rates are well within the radios capabilities, but the false positives on the monitors is a serious problem for us.
Thanks,
bill
Consider this closed.
Oddly while testing PRTG monitor, I suspected that one of our trunks was performing poorly. I logged into each trunk radio and though rssi values looked good at both ends, ARQ resends were high on one end. I slowed the trunk speed from 36Mbps to 24Mbps, without any other changes and the problem seems to have stopped on our entire network.
This may have something to do with our topology, but is still a mystery to me that one link would be this disruptive. Thank you for your support, now I need to get the monitoring fine tuned to see what is really going on. Keep your fingers crossed, I am, I'm currently typing this with one hand.
bill
Bill, have things been solid for the past month since the change?
Yes it has.
The network as a whole has become more predictable. The monitoring is stable and near completion, except for reporting.
I still don't understand how the modification could make such a dramatic change, the only thing I can think of was if it was trashing the core switch. For multiple trunks to have "issues" then clear with one change to one path is odd. We have been running clean, in the 99.98% uptime range for a while now.
Our largest customer called our management independently, confirming this by stating they are observing the same uptimes with their gear.
bill
PS: I still haven't solved why the Giga radios aren't passing LLDP.
Please sign in to leave a comment.