1 of 1 people found this helpful
In order to get all files in the directory automatically without having to add each one, Prep is going to want to do the union first. What transformation would be better prior to the union?
Thanks for the prompt response, Joshua Milligan
It would be amazing if you could define what to do before joining each file as the approach of applying a set of transformations can differ greatly from a simple file to all the files combined into a single table and, from my personal experience using self-service ETL tools, it's always a better idea to transform all files prior to doing the union.
This thread started as a question from one of the comments in one of the videos that I published recently comparing TP to other tools.
Imagine that I have multiple txt or csv files that have the following format:
Before appending every single file, I'd like to remove the first 3 rows (header and subheader), the last 3 rows (footer) and, since I know that every page of my report contains 20 records (data), I can get rid of the "Page Counter and information" based on the row position of them. I could do the same to get rid of the "FieldNames" rows that I don't need as I only need the first one.
I completely lose that "row position" approach when I perform the union first which usually overly complicates things or it can only work for a few scenarios.
In the video you can check at how Power BI (Power Query) does it. That's their 2nd iteration at it as their first one suffered from the same issues that Tableau Prep currently suffers.
I totally agree with you about Tableau Prep needing the ability to expose row position as information is many times encoded by row position. Please consider voting up this idea https://community.tableau.com/ideas/8732 and this one https://community.tableau.com/ideas/8714.
Ideally, the tool should make it so it doesn't matter whether you do the union before or after the transformation. And, in fact, in the case you outlined above, it should be easier to do it once after the union (both conceptually and technically) because you know the row numbers you want to keep per file (3 through 23). If Tableau Prep exposed those row numbers and retained them, per file, in the union, then you could just do a simple filter like:
RowNumber >= 3 AND RowNumber <= 23
on the union results and end up with the rows you needed.
Here's hoping that functionality comes soon!
I voted on those ideas as soon as I saw them before making the reply, but adding the index to the files prior to doing the union would technically count as a transformation step (basically adding a new column). If they could expose that as a new column after the UNION or, perhaps exposing that as metadata for each file, that would be phenomenal. The way other tools do it is that they do have that row order as metadata, but don't expose that "Row" number to the end-user in a direct way.
It's the raw union of the files that makes everything just way more complicated than what it should be. One last question, there's no way of creating an index column in Data Prep right now, correct?
I am not aware of any way for a file-based data source. For databases that support window functions, you could write a Custom SQL that would return a row number or index along with the data.
I could see adding the index prior to or after the union - as long as you could partition the data by filename you could get row number per file. Row-number as metadata would be great too! I'd love to see it exposed as it would allow a lot of flexibility.
Thank you for the discussion!
This is the one reason I cant switch form Power BI to Tableau, because Power BI handles this situation to well. I have a folder and every month I add more csv files to the folder. All I do is refresh and the it automatically merges and data into the new data set.