Communication Channel Health Check showing down / down

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

In NSX 6.2 it is possible to do a "Communication Channel Health Check" to see if the NSX manager the Control Plane Agent + Firewall agent connections are "healthy" and up and running.

I encountered a problem in my lab environment where I had the problem that both the Plane Agent and the Firewall agent connections where both down.

Because of this I was also not able to push any firewall rules to that host.

So I started googeling, and I can across this post. This post told me that the services that should be running are possibly down:

/etc/init.d/netcpad 
/etc/init.d/vShield-Stateful-Firewall

So I verified the status of the services and stopped / started them again.

[root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall status
root ##b##vShield-Stateful-Firewall is running

[root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad status
root ##b##netCP agent service is running

[root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad stop
watchdog-netcpa: Terminating watchdog process with PID 34973
Memory reservation released for netcpa
root ##b##netCP agent service is stopped

[root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall stop
watchdog-vShield-Stateful-Firewall: Terminating watchdog process with PID 35483
root ##b##vShield-Stateful-Firewall stopped
watchdog-dfwpktlogs: Terminating watchdog process with PID 35463
Resource pool 'host/vim/vmvisor/vsfwd' released.

[root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall start
vShield-Stateful-Firewall is not running
watchdog-dfwpktlogs: PID file /var/run/vmware/watchdog-dfwpktlogs.PID does not exist
watchdog-dfwpktlogs: Unable to terminate watchdog: No running watchdog process for dfwpktlogs
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
root ##b##vShield-Stateful-Firewall started

[root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad start
Memory reservation set for netcpa
Reload security domains
root ##b##netCP agent service starts
[root@dc1-pod11-esx-a-03:~]

This unfortunately still did not resolve the problem ...

My next step was that I just rebooted the host and that did not fix the problem either.

I eventually fixed it with the following steps:

Put host in maintenance mode
Take it out of the cluster (drag and drop in DC object)
Reboot twice
Put it back into the cluster (drag and drop in cluster object)
Take is OUT OF maintenance mode
Force sync / resole (in host preparation)

These actions caused a reinstall of the VIB's on the faulty hosts and that eventually resolved the issue. I was trying to resolve this issue whiteout an host reboot, but this was not possible ...