We have a three-node tableau server with a primary and two workers. The primary is just a gateway, and the two worker nodes doing the actual work have a mirroring number of processes of each type. In the last couple of weeks we observe the following weird behavior of our Tableau Server: coming to work in the morning, both VizQL processes on one of the workers are in unlicensed state, and logging into Tableau Server via a browser is obstructed - usually happens after a couple of attempts, when the request finally gets routed to the other worker. Eventually in 10-15 minutes, the processes self-heal, and go green again, then everything is back to normal.
Looking at the server status page immediately after the problem appeared, I can see that both VizQL processes on one of the workers are in unlicensed, then eventually one of them goes green, and finally the other also recovers:
Today it may be worker 1, while tomorrow the other one, and this basically started to happen on a daily basis.
Searching through the documentation I found a suggestion there might be a problem in communication with the primary node (https://onlinehelp.tableau.com/current/server/en-us/trouble_svc_unlicensed.htm) due to network issues or too many requests to that particular worker. The documentation says a stop and start of the entire cluster is a potential solution. Such a restart, however is part of our regular nightly backup procedure, so it does not help.
The timing of the event, also is such that it is impossible to have many rendering requests - it is simply too early before the start of the working day. However it actually follows a period of intensive work of the backgrounders to refresh multiple extracts. Looking at the 'Background Tasks for Extracts' admin view shows these are queued at around 3:30 a.m. GMT, and some of them take as long as one hour or even more to complete, thus in essence preceding the unlicensed VizQL state.
Another suggestion from the documentation was loss of communication between the primary and the workers. We checked for any firewall issues, to exclude this option, and we confirmed there is no firewall blocking the communication between the nodes in the cluster.
Tried to work with support on this, but did not get any satisfactory resolution for the time being, so wanted to check if anyone of you guys had similar issues in the past, and potential avenues of resolution. BTW our server is version 9.0.12, 64-bit working on Windows Server 2012 machines.
Any input is greatly appreciated!