I have the same question and face the same issue. I have had cases where when using sampling my joins when building the flow return no records. And Prep gets really bogged down if I use all data.
What would be great would be if we could build some custom sampling logic. For example, only sample data for a given year to reduce the overall size for debugging purposes, but when the flow is run it pulls all the data. Instead, I'm doing that as a filter on the data input step. But that affects the output when I finally run the flow and I have to remove those filters by hand.
2 of 2 people found this helpful
So I think you understand the batch-mode execution. In your case we will try to push the query down to Teradata, but it's always possible that you did something before the join that precludes this, but it doesn't sound like you're having an issue there.
During debug mode, though, we should be pulling a sample of both A and B into a cache in Hyper. At that point, the downstream join should be done against that cache. So during debug, I would not expect to hit Teradata with the join. I'd also wouldn't expect the join to take 20 minutes, though... This could well be worth a bug filing through support.
You also ask what happens if there are no matching results in the sample. The short answer is that if there are no matching results, then there are no matching results. You can improve your odds of getting matches by flipping the tables over to random sampling (which may be very expensive) or by putting in a temporary filter in the input step to cull down the sample to a range where you know you'll get a match in the join. This is a pretty fundamental issue with sampling, although we do have some ideas on how to improve things in the future.
Hope this helps,
Thank you, that explains it.
As for the 20+ min step execution - after talking to DBAs it appears that a lot of these database objects are cascading views (view on a view on a view etc.) so even trying to do a quick sample on it will cause a full scan on the table at the bottom of that stack. I can try avoiding it by filtering the initial input using fields on which that table is indexed or partitioned - but not knowing how Prep handles the flow it is stabbing in the dark trial & error thing.
Tableau Desktop has some resources published on optimising performance - maybe we can do something similar for Prep?
Got it. Yeah, the one interesting thing to be aware of is that filters (and column selections) in the input step should happen before sampling. If you do them later in the flow, you're working on the already-sampled data. Hopefully that will help you with your tuning.
And yes -- more resources here would be good. Thanks for the feedback.
Isaac, I have been thinking about this and have another question.
The problem I'm having is that if the data is sampled and the join shows exclusions, I have no way of knowing wether those are due to sampling or data issues.
So I tried different approaches and ended up with switching to "all data" using custom SQL which ensures that both A and B inputs are small but matching subsets of all the data, i.e. I expect the join to have no exclusions due to sampling.
The question is - If I don't use custom SQL but instead connect directly to a large table, switch on "all data" and specify a filter in Prep - will that cause Prep to load all data first and then apply the filter, or will Prep add the filtering clause to the initial query in the background?
With sampling ON it is logical to assume that sampled data needs to load first, but with all data it should be OK to apply the filter first.
Right -- your join problem is a great example of the hazard in sampling.
But to answer your question: if the filter predicate can be pushed down, it should be. My caveat here is that we will allow you to write a predicate that the target database may not support, and in that case we'll pull data back first. But most likely this will be pushed down and we'll retrieve the records after the filter.
If that doesn't happen when you expect it, do let us know!