    Server 2018.2.3 status alerts UP with no Down status under moderate load

    Richard McDaniel

      I manage a 4 node Tableau Server 2018.2.3 cluster with 16 2GHz CPUs on each of the 3 worker nodes and 12 2GHz CPUs on the Initial node, 128GB of RAM each.  We recently had a day that saw nearly 500 active users on this cluster, the servers occasionally spiked to 100% CPU usage in real-time, but never got above 60% on the 5-min average and never used more than 40% of the RAM.  We have status alerts enabled to send out recent process status changes.  If we've done our jobs right, I don't expect to see these alerts unless I am actively restarting the server for some scheduled maintenance reason.  If something goes wrong I expect to see at least 1 DOWN event in the status email logged before an UP event in the same, or following email alert after the problem is fixed.  These are normal to me, but recently something strange has happened.


      I started receiving status alerts during this moderately high load window where the Gateway process on some or all of the nodes would alert with at least 1, sometimes between 2-4, UP events.  What's strange is there were no DOWN events and there was no perceived user impact, it's just reporting UP over and over again.  I'm assuming that this was due to some momentary latency or something that caused the event to be triggered but changed the status back to UP before it was recorded.  Regardless, if there really is an issue then this doesn't help and if there is no issue like it says then why alert in the first place?  If Tableau is really that sensitive to latency when it comes to monitoring the Gateway process then it seems like there should be a way to increase the number of failures before alert or increase the acceptable latency value before failure.  I'd like to keep these alerts up for at least the other processes, but if the Gateway process is going to continue alerting that it's UP then it only trains me to ignore these alert emails, which doesn't do us any good.


      Is there any way to adjust the alert thresholds for these events?  Is there truly something wrong that I should investigate when it tells me the Gateway process came UP several times without ever going DOWN?