Welcome to the Forums.
This is similar to the Utilization Problem.
Here is a link that may be helpful:
If it will work for your setup, you can cross join your dataset to a single column of all the possible days.
Then you can filter just to the pertinent values using:
the lookup date will be on your axis.
here are further links
The answer depends on the volume of data you need to process and the average length of stay. There are three possible solutions that all involve joining to a secondary table, but each one comes with a different computational overhead.
- If you have a small data set, in that the Average Days of Stay X Number of Encounters < 1,000,000 you can temporally join each encounter to a table that has one record per day and then use simple aggregates to compute your statistics. This is the full densification solution. It is the most conceptually clear one to work with, but has a very large computational overhead, as you are creating a daily census from every encounter. The join condition is [ADMIT DAY] <= [CENSUS DAY] <= [DISCHARGE DAY]
- If you have a larger data set, in that the Average Months of Stay X Number of Encounters < 1,000,000 you can temporally join each encounter to a table that has one record per month, and then create a calculated column with the number of days of the encounter that coincided with the month. This solution is slightly more abstract, but has a lower computational overhead, so will run more efficiently on larger data sets. The join condition is [ADMIT DAY] <= [CENSUS MONTH END] AND [CENSUS MONTH START] <= [DISCHARGE DAY], the calculated field is [CENSUS MONTH DAYS] := MIN([DISCHARGE DAY], [CENSUS MONTH END]) - MAX([ADMIT DAY], [CENSUS MONTH START]). The monthly statistics would then be based on this calculated field.
- The most intellectually challenging solution, but by very large measure most computationally efficient, is to work with a columnar event store. To construct a columnar event store you will Cartesian join your data to a data set with just two rows, +1 for admit events, and -1 for discharge events. To calculate the monthly averages you then have to work with running sums, where admission increments the sums, and discharges decrements the sums. You have to take great care to keep track of month to month carryovers, so you will need calculations for new in month, discharged in month, and carried over from previous month. This last solution deserves a detailed post all in its own, so I will set aside some time to generate some sample data, and then write up an article in the Calculation and Workbook library. Working with large columnar event stores is governed by The Notorious B.I.G.'s tenth commandment "strictly for live men, not for freshmen"
This is just the starting point. Time series analysis of administrative data sets is a complex and difficult process, I have spent many hours mentoring many analysts in this dark art. Let me know if you need more information.
1 of 1 people found this helpful
The usual formula for computing ADC is total patient days / available days.
Total patient days is the sum of the lengths of stay, i.e., the difference in admit and discharge dates, or last day of the period if they have not been discharged yet.
Available days is the number of beds times days in your measurement period. You may need to join to another table to get the beds per hospital.
That is average daily occupancy per bed.
Where as average daily census, is the average patients per day in the month.
The difference is best understood using dimensional analysis. In both measures the days units in the denominator and numerator cancel. The first measure has units of patients per bed, and is usually expressed as percentage occupancy. The second measure has units of patients, and is best stated as an absolute magnitude.
I'm the author of the Utilization, Queues, Census, Throughput for Blocks, Floors, Units, Rooms, Offices, etc. - What do you do? post that links a number of threads together on this topic. Like @Aaron I've also spent some quality time on this, most recently I've been documenting how to do solve this kind of problem (and a further variation, the moving count distinct problem) using Tableau Prep, I'll link to that when it's up. I'm also tagging Bethany Lyons on this because it's her TC Europe presentation on doing time series analysis in Tableau that inspired me to tackle this problem.
Aaron Sheldon - your comments are great, I've got a few extra notes regarding Tableau that I'll number in order of your comments to match them up:
1. The average days of stay * number of encounters < 1,000,000 suggestion can be too low, IMO. Particularly when using extracts that number can potentially be at least an order of magnitude larger (i.e. 10M) on a modern laptop due to Tableau's performance and the level of compression. Some other potentially mitigating factors besides the type of data source and its performance include the complexity of the desired visualization, number of marks, and the rendering performance. For example I've worked on executive dashboards that need sub-5 second render times and for those we've had to do more pre-computation, which leads to the next point.
2. I really agree with this point, particularly for larger data sets or situations where we need to have faster render times. My approach to explaining this is that rather than getting the data at the finest grain for example one record for each patient per day in bed) we're considering the level of detail of our analysis and redefining the data at that grain, for example one record for each patient month with their # of days in the month. My Tableau comment here for new users is that we won't end up with "the mother of all data sources" that can answer every analytical question, instead we'll have different data sources for answering different questions, like a padded out data source that can show the census and an encounter-level source that can be used to show # of discharges per disease state per time period. This is where a data preparation tool like Tableau Prep, Alteryx, a Jupyter notebook, etc. is really helpful because in those tools we can set up a pipeline such that common operations (like cleansing dates) are efficiently performed once in the pipeline and then the pipeline separates to generate the desired outputs.
3. I've got a couple of other terms that I use to describe this kind of data set - "transactional" or "event" where the grain of the data is an event that is an admit or discharge. With the +1/-1 for each event we can build running totals in Tableau extremely easily and partition them at the day/week/month/etc. with just a couple of clicks. That all works brilliantly, however there are several challenges that can come up:
a) The first is with filtering...running totals are performed as table calculations and those are "late" calculations so the only way to filter them is with additional table calculations, which adds complexity.
b) Another aspect of table calculations is that they (with the exception of TOTAL()) compute over the marks in the view, not the underlying data so the viz level of detail (the dimensions in the view) has to support the calculations we want. Where this can run into a performance issue is if the unit of time is relatively small compared to the window of time, for example if I've got 20 hospital units and I'm looking at patterns of hourly census for a year that's 20 * 24 * 365 = 175,200 raw marks (which might get aggregated over to be a smaller set of displayed marks like the average hourly census) which isn't necessarily too much but I'd have to load in earlier time periods to get all the patients who were in-house at the beginning of the year and that could make the view too slow.
c) If the data is sparse (i.e. not necessarily an event for every unit of time we want to analyze) and we want to display a mark for every period (or aggregate across periods, like a weekday average of the hourly census) then we have to pad out the data at that level. The Step Lines feature introduced in 2018.1 can be a big help here for display-only views because it's effectively "padding out" or "extending" the line to the next event. Sometimes we can do the padding Tableau's data densification behaviors, other times that just adds too much complexity and we need to do that padding in data prep.
Given these challenges sometimes I'll skip using the event-based data and instead using the methods from #1 or #2. There are certainly other ways besides the running sum to do to this and pre-compute results, however those can run into challenges because as Tableau is explicitly an interactive tool for data visualization users often want parameterization (whether through Tableau filters or parameters) and adding those variables to the pre-computation necessary can add so much complexity that going back to the "brute force" methods of #1 and #2 can be easier.
I agree on many of the points above, and this is an ongoing discussion internally. I'm still looking/open for better ways to do this.
My current approach is to calculate the census in the database by joining to each hour of the day. My trick to improve performance is to limit the size of the join to one week (or one month) at a time on both the date table and the patient table ... and then union the results. There is a computational cost to chunking the data, but this outweighs the computational cost of massive joins/searches. I have an Oracle procedure that does this for me. It creates temp tables, indexes them, joins them, updates the final table and then drops the temp tables. It's not elegant, but it works.
I've been able to reduce computation from 7 days to 2 hours with this method. I do this for about 50 different census tables and all of them are hourly based. In all I have about a billion rows of census data ... which is in part because the procedure works so well that I can create any census I want with little effort. I have no significant issue loading 50 million rows into Tableau for analysis.
While I wish I had one "master" census table, I have found it useful to make varying levels of aggregation of these tables (hourly/daily/monthly/bed/unit/site/city/waiting room/treatment/boarding/isolation/attending physician/off service...), the combinations are endless.
Would like know if anyone with access to in-memory/columnar/MPP systems do it differently given the advances in the last couple of years.
The problem of generating census data from stays is part of the larger class of problems of finding intersecting intervals across multiple data sets. These problems are a tidy little puzzle in computational efficiency. The brute force solution, of comparing every possible combination of intervals from each data set, is roughly exponential in the number of data sets.
However, there is a logarithmic time solution in the total number of intervals across all data sets.
So far this solution falls well outside the domain of every database system and visualization product I have seen. Without giving too much intellectual property away, before I have a chance to GPL a Julia prototype to GitHub, the solution involves the in memory construction of something similar to a priority queue.
The mathematical reason why all our current data systems struggle to join on intervals is because all the join techniques depend on some form of binary order comparison to work. However, in 2 dimensions, such as the 2 required to define the start and end of an interval, one cannot define a binary comparison that also preserves the standard real topology defined by Euclidean distance, and hence preserve our traditional sense of probability, in the Borel or Lebesgue sense of measure. So, for example, one could use lexicographic ordering of intervals as the binary comparison, but that would neither find intersections, nor preserve the standard topology. The other way to think of this is that intersection is not a transitive comparison, where as all current database joins heavily leverage the transitive property of binary comparisons to reduce computational overhead.
I agree the issue is "comparing every possible combination of intervals from each data set" ... when we do a brute force/Cartesian join. But since the vast majority of joins fail to match, we can reduce the computation required by preventing them in the first case. I do it by partitioning the data into smaller chunks and looping through them.
For a typical Emergency Department with 200 visits per day the number of joins for a one year hourly census is 200*24*365^2 = 639 million. If we decide to do one week at a time the number of joins is 200*24*7^2 = 235,200 ... x52 weeks = 12 million. There is computation cost to creating and merging each week of data, but this is smaller. I could run the 52 weeks in parallel.
There is a limit to the divide and conquer strategy. Once the divisions become smaller than the interval of stay then you end up increasing the workload. Your method is very good, because it demonstrates the power of carefully considering trade-offs. By working at the week level you minimize the number of stays that span multiple search domains, while still getting good results with a divide and conquer strategy. If you went to divide by the hour, well you would be back where you started. If you want to get really slick, you could optimize the strategy by tuning the division of the search space. But given you are satisfied with the weekly division, further optimization is probably more hassle than it is worth.
Your divide and conquer strategy is very similar to bounding box queries in GIS data sets, where a minimum assertion that must be true of all intersections is used to narrow the search space, even though the assertion in and of itself does not guarantee intersection.
Good Morning to all the answered my post,
The overall response to my questions is very refreshing since so many want to help! Since this is my first post, I was not prepared for the response from this community.
Since I am a beginner and am trying to make things as simple as possible, I have created 2 different kinds of Average Daily Census calculations. One is a Monthly ADC from Patient Days and only works after the month is over which is to be used for reporting monthly statistics after the month. For ADC, I am using a charge detail database that has every charge for every patient and the dates are driven by service date rather than admit or discharge. So, I don't have to create a formula for discharge minus admit. For Average daily census, we want to know the average number of patients in inpatient beds for that month. So, this data base give me patient days for the month rather than discharge days or admit days.
When using the charge detail database, I have a filter using UB revenue codes to include only the UB codes used for inpatient day charges. You only get one UB Rev code charge in this range per day so I sum the units of these codes for the month.
In the formula below- Number of units is the sum of the number of records with the UB Rev codes that account for a day charge. I calculate the number of days in the service month and just divide the total units by that month's total days. It have audited this number and it works. The key is you cannot use this formula until the month is over as it divides by the total days in the month.
Monthly ADC formula:
([Number Of Units])/ DATEDIFF('day',DATETRUNC('month',[Service Date]),
I also have another formula that I can use to calculate the average daily census for the actual day of the month for the last service date in my database. This formula works for those hospitals that have at least one patient in an inpatient bed everyday. All of my hospitals have a census everyday with the exception of a rural hospital. So, I am using this formula for all hospitals except for the one.
First, I created a calculated field called Days in Service Month. So, I am counting every distinct service day. I think this formula will count the days in whatever way you have configured your filters, table or graph (by month, week, quarter etc). I haven't fully audited this formula yet, so if anyone sees an issue with it, let me know.
Calculated field 1- Name- Days in Service Month - Formula -COUNTD([Service Date])
Calculated field 2 - Name- Daily ADC from Patient Days- Sum([Number Of Units])/ [Days in Service Month]
I qualify on Service Date Year (that I have customized to coincide with our Fiscal Year), Service Months, Facility, and UB Revenue Code.
So, I think the only issue I have is with the one hospital that may not have a patient in-house everyday. I haven't come up with a simple solution yet but I am convinced there is one. Thanks for everyone's help. Elona
1 of 1 people found this helpful
The simplest, although not very elegant, solution we have used in situations where a location may not have a patient in every time unit, is to add dummy fill records for those combinations of locations and time units that don't have patients. The catch is that in Tableau, you have to remember to count those dummy records as zero patients.
The generation of fill records can be done at the SQL level by generating a dummy table of all combinations of locations and time units and then left joining back to the source data, ignoring the dummy data when there is source data. This works okay if there is not an extensive or complex amount of filtering. Unfortunately the solution has the possibility to grow exponentially in the number of values in all the fields you are filtering on, because you would need a dummy record for every gap in the combination of location, time unit, and filtered value.
The heart of the problem is that when there is no data Tableau does not know how much or where to fill in zeros. From a programming prospective this is sensible thing to do, because asking Tableau to essentially guess where zeros belong would introduce more problems than solutions. The work around for the need to "tell" Tableau where zeros belong is addressed extensively in the topics of "data padding in Tableau" and "data densification in Tableau", where through the choice of the chart type and critically the table calculation you implicitly "tell" Tableau where to put zeros. The reason the use of a table calculation is critical is because in specifying a table calculation you have to choose the fields the table calculation addresses, this in turn specifies an ordering and granularity that Tableau can pad.
My hope is that in the long term data padding becomes more formally supported, as compared to the current adhoc mechanisms of bins and table calculations. Ideally this would be through the ability to create special padded fields from a source field, much as one can create bins and groups right now.