Welcome to the Forums.
This is similar to the Utilization Problem.
Here is a link that may be helpful:
If it will work for your setup, you can cross join your dataset to a single column of all the possible days.
Then you can filter just to the pertinent values using:
the lookup date will be on your axis.
here are further links
The answer depends on the volume of data you need to process and the average length of stay. There are three possible solutions that all involve joining to a secondary table, but each one comes with a different computational overhead.
- If you have a small data set, in that the Average Days of Stay X Number of Encounters < 1,000,000 you can temporally join each encounter to a table that has one record per day and then use simple aggregates to compute your statistics. This is the full densification solution. It is the most conceptually clear one to work with, but has a very large computational overhead, as you are creating a daily census from every encounter. The join condition is [ADMIT DAY] <= [CENSUS DAY] <= [DISCHARGE DAY]
- If you have a larger data set, in that the Average Months of Stay X Number of Encounters < 1,000,000 you can temporally join each encounter to a table that has one record per month, and then create a calculated column with the number of days of the encounter that coincided with the month. This solution is slightly more abstract, but has a lower computational overhead, so will run more efficiently on larger data sets. The join condition is [ADMIT DAY] <= [CENSUS MONTH END] AND [CENSUS MONTH START] <= [DISCHARGE DAY], the calculated field is [CENSUS MONTH DAYS] := MIN([DISCHARGE DAY], [CENSUS MONTH END]) - MAX([ADMIT DAY], [CENSUS MONTH START]). The monthly statistics would then be based on this calculated field.
- The most intellectually challenging solution, but by very large measure most computationally efficient, is to work with a columnar event store. To construct a columnar event store you will Cartesian join your data to a data set with just two rows, +1 for admit events, and -1 for discharge events. To calculate the monthly averages you then have to work with running sums, where admission increments the sums, and discharges decrements the sums. You have to take great care to keep track of month to month carryovers, so you will need calculations for new in month, discharged in month, and carried over from previous month. This last solution deserves a detailed post all in its own, so I will set aside some time to generate some sample data, and then write up an article in the Calculation and Workbook library. Working with large columnar event stores is governed by The Notorious B.I.G.'s tenth commandment "strictly for live men, not for freshmen"
This is just the starting point. Time series analysis of administrative data sets is a complex and difficult process, I have spent many hours mentoring many analysts in this dark art. Let me know if you need more information.
The usual formula for computing ADC is total patient days / available days.
Total patient days is the sum of the lengths of stay, i.e., the difference in admit and discharge dates, or last day of the period if they have not been discharged yet.
Available days is the number of beds times days in your measurement period. You may need to join to another table to get the beds per hospital.
That is average daily occupancy per bed.
Where as average daily census, is the average patients per day in the month.
The difference is best understood using dimensional analysis. In both measures the days units in the denominator and numerator cancel. The first measure has units of patients per bed, and is usually expressed as percentage occupancy. The second measure has units of patients, and is best stated as an absolute magnitude.
I'm the author of the Utilization, Queues, Census, Throughput for Blocks, Floors, Units, Rooms, Offices, etc. - What do you do? post that links a number of threads together on this topic. Like @Aaron I've also spent some quality time on this, most recently I've been documenting how to do solve this kind of problem (and a further variation, the moving count distinct problem) using Tableau Prep, I'll link to that when it's up. I'm also tagging Bethany Lyons on this because it's her TC Europe presentation on doing time series analysis in Tableau that inspired me to tackle this problem.
Aaron Sheldon - your comments are great, I've got a few extra notes regarding Tableau that I'll number in order of your comments to match them up:
1. The average days of stay * number of encounters < 1,000,000 suggestion can be too low, IMO. Particularly when using extracts that number can potentially be at least an order of magnitude larger (i.e. 10M) on a modern laptop due to Tableau's performance and the level of compression. Some other potentially mitigating factors besides the type of data source and its performance include the complexity of the desired visualization, number of marks, and the rendering performance. For example I've worked on executive dashboards that need sub-5 second render times and for those we've had to do more pre-computation, which leads to the next point.
2. I really agree with this point, particularly for larger data sets or situations where we need to have faster render times. My approach to explaining this is that rather than getting the data at the finest grain for example one record for each patient per day in bed) we're considering the level of detail of our analysis and redefining the data at that grain, for example one record for each patient month with their # of days in the month. My Tableau comment here for new users is that we won't end up with "the mother of all data sources" that can answer every analytical question, instead we'll have different data sources for answering different questions, like a padded out data source that can show the census and an encounter-level source that can be used to show # of discharges per disease state per time period. This is where a data preparation tool like Tableau Prep, Alteryx, a Jupyter notebook, etc. is really helpful because in those tools we can set up a pipeline such that common operations (like cleansing dates) are efficiently performed once in the pipeline and then the pipeline separates to generate the desired outputs.
3. I've got a couple of other terms that I use to describe this kind of data set - "transactional" or "event" where the grain of the data is an event that is an admit or discharge. With the +1/-1 for each event we can build running totals in Tableau extremely easily and partition them at the day/week/month/etc. with just a couple of clicks. That all works brilliantly, however there are several challenges that can come up:
a) The first is with filtering...running totals are performed as table calculations and those are "late" calculations so the only way to filter them is with additional table calculations, which adds complexity.
b) Another aspect of table calculations is that they (with the exception of TOTAL()) compute over the marks in the view, not the underlying data so the viz level of detail (the dimensions in the view) has to support the calculations we want. Where this can run into a performance issue is if the unit of time is relatively small compared to the window of time, for example if I've got 20 hospital units and I'm looking at patterns of hourly census for a year that's 20 * 24 * 365 = 175,200 raw marks (which might get aggregated over to be a smaller set of displayed marks like the average hourly census) which isn't necessarily too much but I'd have to load in earlier time periods to get all the patients who were in-house at the beginning of the year and that could make the view too slow.
c) If the data is sparse (i.e. not necessarily an event for every unit of time we want to analyze) and we want to display a mark for every period (or aggregate across periods, like a weekday average of the hourly census) then we have to pad out the data at that level. The Step Lines feature introduced in 2018.1 can be a big help here for display-only views because it's effectively "padding out" or "extending" the line to the next event. Sometimes we can do the padding Tableau's data densification behaviors, other times that just adds too much complexity and we need to do that padding in data prep.
Given these challenges sometimes I'll skip using the event-based data and instead using the methods from #1 or #2. There are certainly other ways besides the running sum to do to this and pre-compute results, however those can run into challenges because as Tableau is explicitly an interactive tool for data visualization users often want parameterization (whether through Tableau filters or parameters) and adding those variables to the pre-computation necessary can add so much complexity that going back to the "brute force" methods of #1 and #2 can be easier.
I agree on many of the points above, and this is an ongoing discussion internally. I'm still looking/open for better ways to do this.
My current approach is to calculate the census in the database by joining to each hour of the day. My trick to improve performance is to limit the size of the join to one week (or one month) at a time on both the date table and the patient table ... and then union the results. There is a computational cost to chunking the data, but this outweighs the computational cost of massive joins/searches. I have an Oracle procedure that does this for me. It creates temp tables, indexes them, joins them, updates the final table and then drops the temp tables. It's not elegant, but it works.
I've been able to reduce computation from 7 days to 2 hours with this method. I do this for about 50 different census tables and all of them are hourly based. In all I have about a billion rows of census data ... which is in part because the procedure works so well that I can create any census I want with little effort. I have no significant issue loading 50 million rows into Tableau for analysis.
While I wish I had one "master" census table, I have found it useful to make varying levels of aggregation of these tables (hourly/daily/monthly/bed/unit/site/city/waiting room/treatment/boarding/isolation/attending physician/off service...), the combinations are endless.
Would like know if anyone with access to in-memory/columnar/MPP systems do it differently given the advances in the last couple of years.
The problem of generating census data from stays is part of the larger class of problems of finding intersecting intervals across multiple data sets. These problems are a tidy little puzzle in computational efficiency. The brute force solution, of comparing every possible combination of intervals from each data set, is roughly exponential in the number of data sets.
However, there is a logarithmic time solution in the total number of intervals across all data sets.
So far this solution falls well outside the domain of every database system and visualization product I have seen. Without giving too much intellectual property away, before I have a chance to GPL a Julia prototype to GitHub, the solution involves the in memory construction of something similar to a priority queue.
The mathematical reason why all our current data systems struggle to join on intervals is because all the join techniques depend on some form of binary order comparison to work. However, in 2 dimensions, such as the 2 required to define the start and end of an interval, one cannot define a binary comparison that also preserves the standard real topology defined by Euclidean distance, and hence preserve our traditional sense of probability, in the Borel or Lebesgue sense of measure. So, for example, one could use lexicographic ordering of intervals as the binary comparison, but that would neither find intersections, nor preserve the standard topology. The other way to think of this is that intersection is not a transitive comparison, where as all current database joins heavily leverage the transitive property of binary comparisons to reduce computational overhead.
I agree the issue is "comparing every possible combination of intervals from each data set" ... when we do a brute force/Cartesian join. But since the vast majority of joins fail to match, we can reduce the computation required by preventing them in the first case. I do it by partitioning the data into smaller chunks and looping through them.
For a typical Emergency Department with 200 visits per day the number of joins for a one year hourly census is 200*24*365^2 = 639 million. If we decide to do one week at a time the number of joins is 200*24*7^2 = 235,200 ... x52 weeks = 12 million. There is computation cost to creating and merging each week of data, but this is smaller. I could run the 52 weeks in parallel.
There is a limit to the divide and conquer strategy. Once the divisions become smaller than the interval of stay then you end up increasing the workload. Your method is very good, because it demonstrates the power of carefully considering trade-offs. By working at the week level you minimize the number of stays that span multiple search domains, while still getting good results with a divide and conquer strategy. If you went to divide by the hour, well you would be back where you started. If you want to get really slick, you could optimize the strategy by tuning the division of the search space. But given you are satisfied with the weekly division, further optimization is probably more hassle than it is worth.
Your divide and conquer strategy is very similar to bounding box queries in GIS data sets, where a minimum assertion that must be true of all intersections is used to narrow the search space, even though the assertion in and of itself does not guarantee intersection.