Generic framework for Self service BI (part 1)
Generic framework for Self service BI (part 1)
Requirement:
Like any web based companies, ours too is an organisation that needs to analyse a lot of user behaviour. What are the users doing on our platform and how are they interacting with it, are of immense interest to us. These analyses can be based on different phases of user journey- when they are visiting our web pages, when they download our application and explore it, when they register with us and when they turn into paid customers.
We used to consider these analyses separate from each other and maintain different systems to analyse them. This required a large number of resources and a directory full of reports and dashboards. This is the case in a large number of organisations it seems.
We wanted to create a system which would simplify the process so much, as to turn the system into a self service framework. The business teams should not have to go through the loops of raising tickets and wait for resource allocation to get the results of their queries.
Additionally for the BI and analytics team, we wanted to create a system easy to maintain, cut down the multiple systems with potentially conflicting business logics.
It was a big ask, in the following section we will consider the development process of the cohort analysis framework.
Tools:
We were already using tableau and were confident of its capabilities. The business users also found it intuitive to use. To implement our solution we focussed on each of the customization options available and how each could be used-
Custom SQL/datasource of tableau workbook:
The basis of any cohort analysis is the presence of comparable starting sets (A), their attributes (B) and the behaviour of these cohorts in future (C) . We implemented the same logical flow in the data source.
The first table we created was called the “base_table” containing the definition of the cohorts. For example if we are analysing two different landing pages, then the table structure would look like-
Column name: definition
Long_cookie: unique identifier of new visitors on our landing page.
Landing_Page: the page on which they landed (cohort divisions)
Visit_time: the time of visit
visit_Date: the date of visit
The second table we created is a massive pivoted down table that contains all the events that we would want to observe in the future for the above cohorts. The structure of this table would be like-
Long_cookie: long cookie of the event (enriched though lookback if not available)
Hardware_id: hardware id of event if applicable( enriched though lookback if applicable but not available)
User_id: userid if applicable.
Event: name of the event (ex, registration, gameplay, revenue, etc)
Event_value: can be both discrete values like registration or continuous like revenue.
event _time: if we plan to save events at timestamp level then it will be populated, else if we plan to keep aggregated values at day level, then it will be null.
Event_details: this is an additional column to save any attribute that is frequently used with the metric events. Ex, add cash with addcash_amount, revenue with revenue_source, install with app_version.
Event_date: date of occurrence of event
The third table or tables are based on modular design of the framework, capitalizing on the fact that the base table and event values have been saved at a granular level. With long_cookie, hardware_id and user_id already present any additional information required for detailed analysis can just be joined. With any luck, an existing data warehouse will already have tables storing this information. If not, these dimension tables can be created with the constraints that they should not have duplicate values of the column on which they join, otherwise the data will get multiplied.
Metrics and Attributes:
The use of parameters gives us the opportunity to handle any customizations the end users might as for later. One of the cases we wanted to handle were Dn metrics, ie measuring behaviour till certain number of days after inclusion in the cohort.
Here, since our metrics were present in a pivoted down format they can be treated similarly. By using parameters for date difference and which metric to include, we can create a set of generic , replicable metrics.
Visualizations:
We have all faced the situation where more metrics need to be added to analysis and the development needs to be redone. However, since we have treated all metrics in the same way, and our metrics are also generic, we end up with visualizations that can be generic too. Through a dropdown, a metric can be selected, and the visualisation will start reflecting it. Hint: if a new metric needs to be added, we only need to add it to the metric_events table and it becomes available in the drop down immediately.
Additionally to implement the back end solution, ( pivoted down tables, look up tables, dimension tables) etc, we are using databricks for data processing and S3 for storage. Currently for connecting to tableau, we are using Presto.
Methodology:
The practical utility we have developed for the framework is as important as the technical aspects of the solution. We have developed a base dashboard, simply called the vanilla dashboard. It comes with the basic connections of backend tables and a set of standardised visualizations. For any new analysis, a new copy of the vanilla dashboard is created in tableau desktop and the end user starts making required analysis through creation of additional views. BI is involved only if any new data needs to be connected to a data source or some advanced logic (like LOD) needs to be implemented.
This has provided immense flexibility to the end users who do face a learning curve initially. Additionally it has cut down on the QA requirements for data sharing as the base data for all analysis remains the same. Instead of long script logics, the business logic exists on the dashboards itself cutting down the time requirement for cross checking and doubts.
Results:
We were able to remove tickets being raised with us entirely within three months. The stakeholders did not have to wait for data and had much larger freedom to ask more questions from the data by themself.
Based on the learnings from this initial framework we went on to try a few more changes to make the framework event more versatile-
Make the cohort definition table pivoted down too.
Make the selection of grain and the join column through a parameter on the dashboard. This was a very interesting challenge and one not explored in a lot of solutions.
Integrating R with tableau to provide statistical significance for both discrete and continuous metrics.
The integration of these features would be covered in part 2.
Comments
Post a Comment