7 Replies Latest reply on Jul 6, 2018 7:14 AM by Isaac Kunen

    Summary of Join Results Sampled?

    Joshua Milligan

      I'm assuming the Join Result in the Summary represents the sampled data result, because here I'm simulating a cross join of 600K records with 20 records and am seeing only around 1M resulting records.

       

      At one point, I saw a "Sampled" indicator, but it has since disappeared, though the number of result records seems to represent a sample.

       

      Can anyone explain what is going on here?

       

       

      Best Regards,

      Joshua

        • 1. Re: Summary of Join Results Sampled?
          Isaac Kunen

          Hi Joshua,

           

          Yes -- precisely. We sample to try to keep the data manageable while you're massaging it, but there are operations like a join that can increase the number of records, so we will apply our sampling logic on top of them as well.

           

          You should absolutely see the "Sampled" badge when this happens -- and this should draw your attention to the fact that the join results may be misleading. If it's not showing up for you here (and if those columns really are all "1" as it appears) then it's definitely a bug.

           

          Cheers,

          -Isaac

          2 of 2 people found this helpful
          • 2. Re: Summary of Join Results Sampled?
            Joshua Milligan

            Isaac,

             

            Thank you!  Yes, I was pretty sure the "Sampled" badge showed up at first, but subsequently disappeared.  Your explanation makes perfect sense and yes, all the values are 1 for both fields.

             

            Best Regards,

            Joshua

            • 3. Re: Summary of Join Results Sampled?
              Jacob Kratzsch

              Any chance to turn sampling off? I would rather prefer to have performance drops than not being able seeing join results as it might be misleading.

               

              Egg.: One source showed a row-count of 1,048,576. Which is already a sampled count. The other source has 4k rows. My Join results showed me a total of 1,048,576 which is the same amount (again sampled) so I don't get an indicator of how my join worked which makes the summary of join results at least partly pointless.

               

              Just after running an output I got an idea how much rows returned. This makes it difficult to validate data as the output takes time now anyway. Sampling makes sense if you don't need to validate but bigger data sets after creating a flow. I would assume that validating steps isn't a unique use case isn't it?

               

              Regards,
              Jacob

              • 4. Re: Summary of Join Results Sampled?
                Isaac Kunen

                Definitely. In the input step, select "Use all data":

                 

                 

                Perf may suffer!

                 

                Cheers,
                -Isaac

                • 5. Re: Summary of Join Results Sampled?
                  Jacob Kratzsch

                  It is turned off for all my sources and it still samples at around 1M.

                  • 6. Re: Summary of Join Results Sampled?
                    Jacob Kratzsch

                    By turned off I mean I selected "use all data"

                    • 7. Re: Summary of Join Results Sampled?
                      Isaac Kunen

                      Ah! I think I read your post too quickly.

                       

                      The sampling in the input step should affect how many rows we read from the source system. (If not, it's a bug.) BUT, we may still introduce sampling later in the flow, particularly after a join, to protect against data exploding.

                       

                      It's not a bad feature ask to be able to turn this off. You may really shoot yourself in the foot, though. My instinct is to make this a one time operation. E.g., in the join, we could use sampling as usual, but then let you re-run on all data once you think the join is correct.

                       

                      -Isaac