How about moving the dataset to another dedicated instance of SQL Server? Instead of producing an extract you can create process to host this data set separately. In that case even an average server should be able to handle 100M rows. If you guys are okay with storing your data in the cloud, RedShift would be awesome for this and quite fast. In both cases you would have a table with Student ID as primary key and the 100+ metrics as columns.
2 of 2 people found this helpful
I'm looking for general ideas/approaches to querying such large amounts of data but being a fast tool for users.
I would create an extract, and if possible, hide/ remove any redundant fields (reduces size), and if possible pre-aggregate the data (again to save space, be reducing the row count and hopefully thereby improve performance). By pre-aggregate I mean roll day level data up to week, month,or year level.
Also, (and I do this where possible) swap out long strings to short numeric values - if your student ids are ABC123456789, and you don't need to identify each student, just swap them all to 1, you'll still get the counts, volume etc but it will run faster and be smaller.
Along this theme - if you are able to split/ reduce the data further by splitting into year, area, state, country etc. it would obviously have benefits on performance.
Using the page field could also help with display/usage
Pre-calcuating some of the metrics, rather than calculating on the fly will also help.
Just a few ideas
1 of 1 people found this helpful
Thanks for the suggestion! This is something I will investigate with my team
Thanks for the help! Unfortunately, while I appreciate the idea, our leadership team would not like it aggregated (still on a per-student level) and doesn't want to remove student-identifying information. Rolling it up (to something weekly, monthly, etc) isn't relevant to our data set, but again I appreciate the suggestions. Always good to look at it through another paradigm!
ok, so here's my take on your quandary and what to look for.
1. Do you really need all 100 metrics or can you scale down the list?
2. Assuming the answer is no, I initially lean toward the extract approach as the way that the data is structured internally is based not so much on # of rows (certainly it is a factor), but rather on the cardinality of the data and the sparsity of the data. In other words, if there are common values across rows or many of a some measures are missing across the population, then these will either be compressed or be minimal. So in other words, I advise doing some up-front data profiling work upfront in pursuit of estimating the speed.
3. Get some fast disk within your server hardware - either SSD or even RAID-10
4. Make sure that all your metrics are numeric. If they are, these will take less space and therefore render faster
Have you tried it yet with the entire dataset? How does it end up?