I’ve been very interested in the possibility of building more useful applications by analyzing how users are moving through the applications we build, how they interact with them, and how that changes as the application evolves.
Over the past few weeks I’ve been trying to build a system for analyzing user behavior through my streaming Snowplow Analytics stack.
It’s been pretty tough.
This is something I’ve wanted to do for a while. But I’ve always been overwhelmed by the sheer complexity of managing and manipulating the large streams of data.
How can I handle the large volumes of data in an efficient way? What if I lose some data? What if I duplicate the data? What if these issues lead my coworkers to draw false conclusions about our users?
These are all very valid questions. Questions which I am probably not the most well-equipped in the world to answer.
But – We all have to start somewhere. We can only spend so much time lying around, waiting for someone else to build our dream system. No better time than the present to get our hands dirty and learn a thing. Do something interesting.
Plus, it’s fun.
Here, I’ve written down some of the issues that have come up and how I’ve been solving them so far.
The analytics stack that I’m working with is running, in real-time, through a Kinesis Stream using a streaming version of Snowplow Analytics setup that I have described how to build.
I have a Github project that I’ve started. It uses all of these techniques. It is not currently well documented or easy-to-use. But if it proves useful, I will be making it more well-documented and well-tested in the future!
One thing that I’ve been wanting to build is a real-time user segmentation system and user funnel tracking system.
This means that the order of events matters. If we want to see a list of people who’ve read Part 1 – The Snowplow Collector and then read Part 2 – The Snowplow Stream Enrichment Process, We need to make sure that the events are processed in the correct order.
We want to prevent this scenario:
If the order could be mixed up, the funnels would be incorrect and we could draw the wrong conclusions for how people are flowing through the website.
The first thing to understand is that Kinesis has at least once message guarantees. This means that, if a message makes it to Kinesis, we will always receive it at least one time.
It also means that we might receive the same message more than once.
This affects us in an interesting way:
- Data can be duplicated
- Event order can be messed up through data duplication
This is what that might look like: