Developers that use WebLogic Server instances that are in a cluster are (or should be) pretty familiar with how servers communicate with one another using multicast and sockets. Multicast or IP multicast is a simple broadcast technology that enables multiple applications to subscribe to a given IP address and port number and listen for messages. The IP/port combination is setup when the cluster is defined and server instances uses multicast for JNDI updates and cluster heartbeats. A WebLogic server uses multicast to broadcast regular heartbeat messages that advertise its availability in the cluster. If you have a cluster in a network segment where multicast isn’t working, you have weird problems.
I recently ran into an issue that took days to fix because the utility provided by WebLogic to debug multicast problems, MulticastTest is broken. As part of a datacenter move, we were moving 2 servers running Linux and WebLogic. In the past, the 2 servers were in 2 different data-centers but were part of the same VLAN, which essentially simulated a subnet. Both servers acted like they were in the same network segment even though they were geographically separated in 2 different data-centers. As part of the move, the VLAN that connected these servers was removed but the routers were configured to move multicast traffic so that the WebLogic server instances running on these 2 servers could see each other and cluster together and offer failover, etc. When the servers were moved and disconnected from the VLAN, weird things started to happen that would cause application hangs, stuck threads, etc. Suspecting a network issue, I fire up the MulticastTest utility on both sides to see if the multicast is working. The syntax is pretty straightforward:
$ java utils.MulticastTest -n server1 -a 224.x.x.x –p 9001
Once you start this on server1, you go to server2 and fire up the same utility with server2 as the name and the same multicast IP/port combination. If multicast is working correctly and the routers aren’t dropping it, server1 should see broadcast from server2 and vica versa. But we didn’t see that and so our network guys spent time figuring out why our routers weren’t routing that traffic over. After some configuration changes and new IP range that wasn’t using the obsolete RIP range, I got the all-clear to try again. So I fire up the MulticastTest utility again and server1 still couldn’t see server2. So the network guys try again and they still no issues and we cannot figure out why the test utility is not working. After spending hours on that, we decide to just fire up the WebLogic server instances on server 1 and 2 and guess what: multicast is working. WebLogic server instances on server1 see server1 and 2 and vice versa – WTF. So I try the MulticastTest utility and it’s still not working. I made the assumption that the MulticastTest utility would be using the same codebase as the WebLogic server but I guess it’s not as the utility is broken. Another issue is that the MulticastTest utility does not let you specify the multicast TTL (time-to-live) and that may be the issue on a WAN. I’ve submitted this as a bug to support and hope they fix it in the next service pack. (The version of WebLogic involved here is WebLogic 8.1 SP4).
WebLogic, cluster, multicast, vlan