4 Replies Latest reply on Jan 30, 2018 1:44 PM by Toby Erkson

    Good Practices for improving performance

    Joshua Milligan

      I'm assuming that doing some or all of the following would help improve the performance of a flow, but would anyone be able to confirm and even provide some technical knowledge on why/why not?

       

      (in no particular order)

       

      1. Removing unused fields early in the flow especially prior to joins or unions, especially if a join duplicates rows
      2. Performing calculations on a few rows a data prior to a join that "blows out" the number of rows as opposed to performing the same calculations after the join
      3. Filtering to remove a large number of records as close to the beginning of the flow as possible rather than towards the end
        a. Does it matter if filtering is done on the source node or the first cleaning step after the source?
      4. Minimizing the use of string calculations?

       

      Are there any other good practices we should employ?

       

      Best Regards,

      Joshua

        • 1. Re: Good Practices for improving performance
          Isaac Kunen

          Hi Joshua,

           

          The answers to some of these are a little subtle. I'm going to try to be brief; please follow up if brevity ends up trumping clarity.

           

          First, I'm going to distinguish between debug and batch execution modes:

          • When you're building a flow, you're in debug mode. In debug mode, you can imagine that we cache each input in Hyper, and on each subsequent operation, we apply the operation and cache the result. This isn't actually true -- we try to avoid materializing all of these internal caches -- but it's an okay mental model.
          • When you execute a flow, you're in batch mode. In batch mode, we try to collect up the largest pieces of the flow possible and execute them as a single query through our federation engine. Today, that means one query per output. Because we're executing one big query, we can apply optimizations before we federate the query, and each of the target databases can apply more optimizations on their respective subqueries.

           

          Now, your optimizations:

          1. This should help during debug mode, since you're reducing the size of the intermediate caches before the union operator. If you do remove the columns in the input node, we'll up the number of rows we pull as our sample, which may give you a more representative set to work with. In batch mode, however, the query processor will generally push column removals down as far as it can, so I would not expect there to be much of a benefit. Put another way, the query processor should automatically give you the benefit of this optimization during batch mode.
          2. An example here might help. I'm thinking of a pivot that makes the table longer and narrower. I suspect this is going to depend a lot on how much of a "blow out" you get, as well as the join clause and other factors, But my gut instinct is the opposite of yours: that you'll be best off keeping the tables as short as possible prior to the join, and blowing them out afterwards.
          3. The answer to this is very similar to the answer to (1). During debug mode, filtering early will probably help with performance. Filtering in the input node will have the added benefit of giving you an on-target sample. But during batch mode, the system will try to push filters down as much as possible, so you'll often get the perf gain of early filtering regardless of where you put them.
          4. Yeah. The devil is in the details, of course, but string calcs are never the fastest.

           

          There are going to be a lot of other good practices. I'll leave you with one big thought, though. In Project Maestro, unlike Tableau, we give you the same operations regardless of your underlying data source(s). This is more powerful, but it gives you more opportunity to shoot yourself in the foot perf-wise.

           

          For example, if I have a large sales table in SQL Server, and I want to get average sales by department, I can do this in either Project Maestro or Tableau. In Project Maestro (batch mode) and in Tableau this will probably perform very well, since we'll push a GROUP BY query off to the SQL Server engine, and all that will come back is a small number of aggregated rows.

           

          If, on the other hand, I want the median... Well, Tableau won't let you do this without taking an extract because SQL Server doesn't support median. Project Maestro will let you take the median, but it will accomplish this task by pulling all the rows into Hyper, and then using Hyper to perform the median. The system will try to pull back only the columns it needs, so it won't generally pull the entire table, but this could still be a lot of data, and may take a while to complete.

           

          Cheers,

          -Isaac

          1 of 1 people found this helpful
          • 2. Re: Good Practices for improving performance
            Joshua Milligan

            Isaac,

             

            That is incredibly helpful!  Especially the distinction between debug and batch mode.

             

             

            For Number 2, the example might be something along the lines of having 1 table of customers that includes their birth date and another table of orders.  When I join, I'll likely have multiple records for a single customer.  Would it be more efficient to calculate the age of the customer prior to joining the tables so that the calculation is done once per customer as opposed to once per order?  Or, in batch mode, does it even matter?

             

            Best Regards,

            Joshua

            • 3. Re: Good Practices for improving performance
              Isaac Kunen

              Hi Joshua,

               

              Glad that stuck.

               

              Gotcha. In this case, it's not that you're going to expand the number of rows before the join: you're expanding the number of columns with the new calculation. My gut feeling is that (a) it's probably better to do this before the join, but (b) that the difference will be marginal unless the calculation is very expensive. This is also likely to be reordered by query processing during batch execution.

               

              Cheers,

              -Isaac

              1 of 1 people found this helpful
              • 4. Re: Good Practices for improving performance
                Toby Erkson

                Isaac Kunen wrote:

                 

                ...

                 

                If, on the other hand, I want the median... Well, Tableau won't let you do this without taking an extract because SQL Server doesn't support median. Project Maestro will let you take the median, but it will accomplish this task by pulling all the rows into Hyper, and then using Hyper to perform the median. The system will try to pull back only the columns it needs, so it won't generally pull the entire table, but this could still be a lot of data, and may take a while to complete.

                My hope here is that ProM will be "smart" enough to recognize this and provide the user with adequate warning(s) (all of which can be optionally suppressed ).