4 Replies Latest reply on Nov 20, 2017 3:15 PM by Dinesh karem

    Aggregating an aggregate for histograms

    Johnathon Hege

      I'm hoping someone can help me out.  I work in the web analytics field and recently was given a task that has me scratching my head.


      I have a table, Visits, that contains one row for each visit someone made to our website.  It also contains a user_id field (anonymous hash showing who did the visiting) and a dt field (date of visit).  The brass wants the ability to answer this question: For any arbitrary date range, how many visitors had one visit, two visits, etc.  Preferably in the form of a histogram.


      Since the data is at a visit level, this requires two aggregations... One to aggregate number of visits per visitor, then another to aggregate visitors by visit bucket.  I used window_sum() to achieve this, creating a calculated field for each bucket, and made a working graph that gets the right numbers.  Performance, though, is a serious issue for longer time ranges.  We see about 1 million distinct user_ids a day, and I think having user_id on the level of detail shelf is killing me. The actual query executes in Vertica in less than 3 seconds, but I'm often left waiting for three or four minutes for the display to render. Any ideas?


      I've attached a sample workbook.  Performance is great, but again, this is a small fraction of our daily traffic.  Any suggestions for improved efficiency would be wonderful.