0 Replies Latest reply on Dec 22, 2015 11:21 AM by Matt Coles

    Success stories using VizAlerts at Tableau

    Matt Coles

      Happy Holidays everyone!

       

      It's been awhile since I've talked about alerts we've built and used here at Tableau, so I thought I'd take a minute and share some stories not just about the alerts we've built, but what we've built that has worked well for us, as well as unintended benefits we've seen in running VizAlerts for so many months.

       

      At this point, we've got 45 distinct alerts running on our Tableau Server instance, serving 38 separate individuals, with a total of 72 subscriptions total. A few are advanced alerts that push alert emails to arbitrary Tableau employees, so we're actually serving a larger population, but these are the trackable metrics.

       

      But who cares about those numbers? What truly matters is what benefit VizAlerts actually brings to us as individuals, teams, and as a company. So what are they? Here are a few of the biggest benefits we've seen:

       

       

      1. We can detect and correct issues with our integration software within minutes, rather than hours

      Quarter endings are always busy at Tableau, and the number of orders being processed can spike to large amounts. In the past, this has resulted in our homebuilt automated order processing software becoming overwhelmed and halting processing. Not good at a time when throughput is most critical! It just so happens, however, that this system pushes status data into Salesforce, which we then pull down into a SQL Server database every 15 minutes. Dashboards were already created to visualize this data, but they had to be manually checked to find an issue. A PM from our operations team built an alert on this data that runs every 15 minutes and emails him (and the 8 other people who've subscribed to it) to tell him one or more of our integration tasks haven't run within the desired time threshold. Each of the thresholds is customized for each of the different tasks our integration software performs, with the values for each being stored as Parameters for easy altering. Right after it was first deployed, this alert caught issues with the software before the developers noticed any problems, which was rare for us! And now, should the software begin to fail, all the people who need to know about it will know nearly immediately that there is an issue, and which processing is not happening.

       

      Interestingly, running this alert every 15 minutes had a separate and beneficial side effect of revealing an ETL issue that we hadn't known about before. As it turned out, our Salesforce -> SQL Server ETL was being massively delayed during the morning hours, between around 6am to 8am. It happened every day, and the alert would fire several false positives during that timespan. We had our DBAs investigate, and they found that the system was being overwhelmed by a massive number of records that were being unnecessarily dumped in. They tweaked that process to improve efficiency, and we now no longer have the issue. This issue had been affecting the whole company, but no one had ever know about or reported it with enough detail to have found the root cause. Because VizAlerts was not only watching the data so frequently, but also exporting CSVs containing specific data it saw at that time, we were able to establish a pattern and build evidence to definitively prove there was a problem, and get it fixed.

       

      an example of a "false positive" alert showing all systems as down--this was actually indicative of an ETL issue

       

      opa.png

       

       

      2. We can provide just-in-time instructions for users having problems

      Since everyone at Tableau has access to use Tableau Desktop and Tableau Server, our back-end database systems receive a lot of traffic. So much so that in order to simply keep our system up, our DBAs have had to set up a job that automatically kills queries that are blocking others, or run too long, or consume too many resources. That helps solve a lot of system stability issues, but it doesn't result in a good end-user experience for someone trying to do some data analysis whose query is terminated with extreme prejudice! Luckily, the auto-kill job creates records in a table which track various aspects of the query it killed, which gives us enough information to send an informative email to users informing them (1) why their query was killed and (2) how to avoid this in the future. Previously, emails were only sent to the individuals that had persistent issues, and since it was a manual process, it consumed a lot of time. Everyone else just had their query killed with no notification whatsoever.

       

      Now, we've set up an Advanced Alert that runs every 15 minutes which cues off the table the query data is stored in. It determines what the issue with the query was, then uses some dynamic HTML formatting to tell the user to do specific things to resolve it. It also tells the user what application they were using to issue the query, from what machine, how long it ran, what time it was killed, and then the text of the actual query. And because this isn't really an email anyone actually wants to get, I injected a little bit of fun into it by making the Subject line randomly cycle through ten different "exclamations" it can use, based on the modulo (%, aka remainder after division) of the Id of the table record. So some emails will say "Bah!", others will say "Crikey!", another might say "Aarrg!"...you get the idea.

       

      Bottom line is, with this alert, we improve users' experience by both helping them correct issues, and in so doing we also reduce strain on our back-end systems. All with zero ongoing effort from the admins!

       

       

      this user randomly received a seasonally appropriate subject in their alert

      sqlkill.png

       

      3. We caught several bugs in our products before they were shipped to you

      Using our own software is a core cultural value at Tableau, as you all already know. The Tableau Server I maintain is used by all of Sales, Marketing and Operations within Tableau Software, and having such a large and active user community providing feedback on problems they have or improvements they'd like to see made is critical to us building and selling a great set of products. But users can't find every issue, and even if they do, that particular person might forget to mention it, or might not know what to do about it. That's where automation and data-driven alerting can play a big role in helping us watch for problems on our server, and thus in our product. Automated tests provide us data to review, and VizAlerts notifies us when there's been a problem of sufficient severity that we should take note. The nice part about that is, not only does VizAlerts proactively notify us, but it also helps by providing some objective criteria for when there's a problem, triggering us to look at it. It's an interesting psychological phenomena: I'll peruse a monitoring viz one day and notice there's a slightly larger mark in one area, or some additional color somewhere--but I may choose to ignore it depending on how busy I am, or (if I'm honest) how lazy I'm being. But if you've had to define your thresholds up front, and they're exceeded, you get a black and white criteria for whether you need to do something or not depending on when the email comes in. That being the case has caused us to look into some issues we might otherwise have overlooked.

       

      So the specific benefit in this case has been catching three or four bugs in Tableau Server than we'd otherwise not have caught--or at least, catching them much sooner than we normally would have. We run an internal tool called VizCrawler (derived from TabJolt) against several of our Tableau Servers nightly. It loads the top N views and logs whether they succeeded or failed, and how long they took to render. Using this, we've caught problems with backup/restore, rendering views under load, and problems with back-end systems unrelated to our products.

       

      What's more, is that by simple virtue of running VizAlerts at all (forget the specific alerts themselves), we've found and fixed several other issues with our product. Firing off trusted tickets requests every minute, exporting CSVs, and rendering PNGs from Tableau Server on an ongoing basis identified two other problems that we've since fixed--problems we've never even known about! That's the kind of thing that makes me extremely happy.

       

      vizcrawl.png

       

       

       

      So that's three examples of how we at Tableau have benefited from running VizAlerts. If you've got some particular success stories you want to share, I'd love to hear them. If not, I hope you will soon!

       

       

      thanks guys.

       

      Matt