2 of 2 people found this helpful
Yes -- precisely. We sample to try to keep the data manageable while you're massaging it, but there are operations like a join that can increase the number of records, so we will apply our sampling logic on top of them as well.
You should absolutely see the "Sampled" badge when this happens -- and this should draw your attention to the fact that the join results may be misleading. If it's not showing up for you here (and if those columns really are all "1" as it appears) then it's definitely a bug.
Thank you! Yes, I was pretty sure the "Sampled" badge showed up at first, but subsequently disappeared. Your explanation makes perfect sense and yes, all the values are 1 for both fields.
Any chance to turn sampling off? I would rather prefer to have performance drops than not being able seeing join results as it might be misleading.
Egg.: One source showed a row-count of 1,048,576. Which is already a sampled count. The other source has 4k rows. My Join results showed me a total of 1,048,576 which is the same amount (again sampled) so I don't get an indicator of how my join worked which makes the summary of join results at least partly pointless.
Just after running an output I got an idea how much rows returned. This makes it difficult to validate data as the output takes time now anyway. Sampling makes sense if you don't need to validate but bigger data sets after creating a flow. I would assume that validating steps isn't a unique use case isn't it?
It is turned off for all my sources and it still samples at around 1M.
By turned off I mean I selected "use all data"
Ah! I think I read your post too quickly.
The sampling in the input step should affect how many rows we read from the source system. (If not, it's a bug.) BUT, we may still introduce sampling later in the flow, particularly after a join, to protect against data exploding.
It's not a bad feature ask to be able to turn this off. You may really shoot yourself in the foot, though. My instinct is to make this a one time operation. E.g., in the join, we could use sampling as usual, but then let you re-run on all data once you think the join is correct.