This is a great question. There are several ways you might approach it, but here's how I would do it.
As an example, here's a data set. It contains the sales of books for dates, but some of the values are missing Let's say I want to fill in missing values for the Copies Sold with the mean and the Highest Seller with the mode.
To do this, I'll use a couple of aggregation steps to find the median and mode:
I've grouped by title and may want to consider other groupings as make sense based on the data.
The Mode is going to take a bit more effort. I'll start by finding the number of records per Seller (and Title) value:
Then, I'll need to find out which Seller (per Title) had the highest number of rows (16 in this case). I can do this by first finding the MAX value:
And then joining back to the previous step where the MAX number equals the value. And you can see that I get a single row with the Seller being Amazon. That's my mode:
Then, I'll take both results and join them back to the original data set to include the median and mode values. You can see the Copies Sold from the aggregation contains the median value:
And the same with the Mode for Highest Seller:
Now, I can fill in the Null values in the original columns with the Median and Mode values (similar to the way discussed here: If data "x" is in one column, replace null in other column with "x" )
I can even use similar logic to create a status field that will let me know whether a value is actual or imputed and this may be helpful in future analysis:
I can even round the Copies Sold if desired (as I know I'm not going to sell 1/2 a book):
After doing something similar for Highest Seller and cleaning out some extra fields introduced by the join, I have a great data set with missing values filled in:
Hope that helps!
Missing Values.tflx 10.8 KB