Simple yet Scalable Web Analytics: JSON in SQL with batch inserts
When building landing pages and blogs, we usually want to have some traffic data and its analytics. Monitoring activity on our web pages turns out to be quite useful; it allows us to answer questions of the following kind:
- How many views do we get?
- How many unique visitors do we have?
- Do people just view our page and exit, or maybe they stay for longer?
- What browsers, operating systems and devices do our visitors use?
- How many times and by who was our Call To Action acted upon?
Similarly, when we build web applications, we want to have analytical data to understand the behaviors and interactions of our users. Thanks to this, we can know things like:
- which features are used and are important, which are not
- where users spend the most amount of time and attention
- how they navigate between screens, do they get lost
- how often, when, and for how long they use our app
As it is a rather common and generic problem, there are many ready to use solutions on the market. But because:
- they always require some frontend work for the integration, in many cases no less than a custom solution
- we need to send rather sensitive data to a third party - IP addresses, users' actions and behaviors; custom solution is far better for privacy
- it can be expensive to use a managed solution or time consuming to host and maintain it
- they are often not flexbile enough
- this problem is not that complicated
and what is more, I am always deeply curious how things really work and see a tremendous value in having this understanding. For these reasons, I have come up with my own, quite scalable yet simple and universal, solution. Let's then explore it!
Solution
High-level overview
- There is a backend: example implementation is in Java but it is a secondary concern, it can be rewritten to any programming language
- Backend needs to have just one endpoint:
POST: /analytics/events
that accepts semi-structured events data; it means that there are a few required fields, like deviceId, browser, operatingSystem or device, but arbitrary data json, without any schema, can be sent as well - Thanks to this approach, we have some common, required data for events but we can also add any additional, arbitrary data, depending on the event type
- Events are saved in the SQL database with solid JSON support (at least PostgreSQL, MySQL and SQLite have it)
- By using JSON in SQL, we can benefit from all its features: widely-known and supported query language, rich ecosystem of tools and libraries - while still having flexible schema for events data
- To make it highly scalable, with the ability to handle thousands of events per second and more, events are saved in batches, with a slight delay, not immediately as they come; there is a simple condition:
if there are >= N events in memory after addition
or
every M milliseconds check if there are any events in memory
- save all events
- In the example implementation, we mostly take device data from the
userAgent
property on the frontend (client) side but it can be done on the backend as well
Detailed walkthrough
As an example is worth a thousand words, I have implemented and tested this approach in Java with Postgres.
Now, let's go over it.
We have an analytics event table, with some indexes:
CREATE TABLE analytics_event (
timestamp TIMESTAMP NOT NULL,
ip TEXT NOT NULL,
device_id UUID NOT NULL,
user_id UUID,
url TEXT NOT NULL,
browser TEXT NOT NULL,
os TEXT NOT NULL,
device TEXT NOT NULL,
type TEXT NOT NULL,
data JSONB
);
CREATE INDEX analytics_event_timestamp ON analytics_event(timestamp);
CREATE INDEX analytics_event_type_timestamp ON analytics_event(type, timestamp);
Explanation:
timestamp
- time when an event has occurredip
- IP address of a clientdevice_id
- unique id of client's device; in our implementation, it is just a random UUID stored in the browser Local Storageuser_id
- id of currently signed-in user or null for anonymous sessions/requestsurl
- url of the current web page; can help with debugging or checking out path and/or query paramsbrowser
- enum: Chrome, Firefox, Safari, Edge, Opera, etc.os
- enum describing operating system: Linux, Windows, Mac, Android, iOS, etc.device
- enum describing device type: Mobile, Tablet, Desktoptype
- enum describing event type; it can be anything, for example: landing-page-view, landing-page-scrolled, sign-up-click, home-page-view, user-profile-view and so ondata
- arbitrary, optional piece of data that can have any schema; it represents additional details that might be useful for some events, such as:{ "user-profile-id": "484099e3-e5bc-4e66-88d0-98be6e86a53b" }
{ "view-time": 1000 }
{ "search-input": "Some Gold" }
There is also an endpoint that accepts most of the table data:
POST: /analytics/events
Headers:
real-ip (optional): if we host it behind reverse proxy
Body:
{
"deviceId": UUID,
"url": String,
"browser": String,
"os": String,
"device": String,
"type": String,
"data": Object (any type or null)
}
What is different here:
userId
is taken from the current user session/authToken; for obvious (security) reasons, only the backend can set itip
is either extracted from the request by a reverse proxy and set in thereal-ip
header or the backend itself extracts ittimestamp
is set by the backend, when an event appears
Also, we perform some rudimentary validation, which is crucial, especially considering the fact that this endpoint must be public. Checks performed here are very basic: whether required fields are null or empty, or too long - we do not want to be attacked with random events of arbitrary size. We should probably also check whether there is no arbitrary JavaScript in these fields to avoid an XSS attack, if we visualize this data in a UI susceptible to this attack.
When it comes to scalability, our bottleneck will be the SQL database first, server resources - second. How can we optimize db writes? The idea is very simple, yet highly effective:
- we have in-memory
toCreateEvents
list - it has a maximum size N
- on new event, we do not create it immediately but:
- acquire lock
- add event to
toCreateEvents
list - check its size: if it is less than N, release lock and return, else...
- ...create N events in a single batch SQL insert query
- empty
toCreateEvents
list - release lock
- we perform similar check every M milliseconds: if
toCreateEvents
list is not empty, insert all events and empty the list - it is done because
toCreateEvents
list might never reach N size and events should be available, more or less, in real time
It does have a tradeoff: there is a possibility of losing some events. We do not have a guarantee that the database does not crash in a time period between receiving an event and the attempt to insert it later on, in a batch. Fortunately, there are two conditions: greater or equal than N size of the toCreateEvents
list and a scheduler that will retry insertion every M milliseconds. If we stop the backend when the database is not available, we might lose some data - but these are analytical events, and this low probability possibility is a perfectly acceptable tradeoff for the huge performance gains it brings. Depending on the batch size (N), this approach can reduce the number of inserts by 100- to 1000-fold.
On the frontend, we take most of the device data from the userAgent property. It is possible to extract the operating system and browser from it (versions as well). A few examples:
// iOS, Safari
Mozilla/5.0 (iPhone; CPU iPhone OS 17_6_1 like Mac OS X)
AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Mobile/15E148 Safari/604.1
// Linux, Chrome
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/126.0.0.0 Safari/537.36
// Windows, Firefox
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0)
Gecko/20100101 Firefox/129.0
Additionally, the deviceId
is generated as UUID and stored in the Local Storage; device
enum is computed based on the screen size:
const DESKTOP_DEVICE_WIDTH_THRESHOLD = 1200;
const MOBILE_DEVICE_SIZE_THRESHOLD = 500;
export function getDevice() {
const portraitMode = window.innerWidth < window.innerHeight;
const landscapeMode = window.innerWidth >= window.innerHeight;
if ((portraitMode && window.innerWidth < MOBILE_DEVICE_SIZE_THRESHOLD) ||
(landscapeMode && window.innerHeight < MOBILE_DEVICE_SIZE_THRESHOLD)) {
return MOBILE_DEVICE;
}
if (window.innerWidth < DESKTOP_DEVICE_WIDTH_THRESHOLD) {
return TABLET_DEVICE;
}
if (window.innerWidth >= DESKTOP_DEVICE_WIDTH_THRESHOLD) {
return DESKTOP_DEVICE;
}
return UNKNOWN_DEVICE;
}
As we now know how it works, let's generate some data and run analytical queries - that is the whole purpose after all!
Example data and analytics
Data
How can we get some example data?
From the the simple-web-analytics/db
directory, run (for the requirements, check out README):
bash build_and_run.bash
In a different terminal and from the simple-web-analytics
directory, start the backend:
bash build_and_run.bash
In yet another terminal, same directory, start the script to generate some random data:
bash generate_random_analytics_events.bash
It will generate random 100 000 events for the last month - it should take no more than 15 to 30 seconds. Once it is done, we can query the data.
Analytics
Simple events count in various periods:
SELECT COUNT(*) AS events
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '1 hour');
events
--------
132
SELECT COUNT(*) AS events
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '1 day');
events
--------
3175
SELECT COUNT(*) AS events
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '1 month');
events
--------
99996
Unique IP addresses per day:
SELECT
timestamp::DATE AS day,
COUNT(DISTINCT ip) AS ips
FROM analytics_event
WHERE timestamp >= (NOW() - interval '7 day')
GROUP BY day
ORDER BY day;
day | ips
------------+-----
2024-08-08 | 312
2024-08-09 | 312
2024-08-10 | 312
2024-08-11 | 312
2024-08-12 | 312
2024-08-13 | 312
2024-08-14 | 312
2024-08-15 | 295
Unique deviceIds per day:
SELECT
timestamp::DATE AS day,
COUNT(DISTINCT device_id) AS device_ids
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
GROUP BY day
ORDER BY day;
day | device_ids
------------+------------
2024-08-08 | 901
2024-08-09 | 959
2024-08-10 | 963
2024-08-11 | 957
2024-08-12 | 971
2024-08-13 | 951
2024-08-14 | 957
2024-08-15 | 609
Unique userIds per day:
SELECT
timestamp::DATE AS day,
COUNT(DISTINCT user_id) AS user_ids
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
GROUP BY day
ORDER BY day;
day | user_ids
------------+----------
2024-08-08 | 448
2024-08-09 | 474
2024-08-10 | 484
2024-08-11 | 481
2024-08-12 | 482
2024-08-13 | 479
2024-08-14 | 481
2024-08-15 | 329
Anonymous vs user sessions:
SELECT
timestamp::DATE AS day,
SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS anonymous_sessions,
SUM(CASE WHEN user_id IS NULL THEN 0 ELSE 1 END) AS user_sessions
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
GROUP BY day
ORDER BY day;
day | anonymous_sessions | user_sessions
------------+--------------------+---------------
2024-08-08 | 1109 | 1140
2024-08-09 | 1598 | 1590
2024-08-10 | 1642 | 1618
2024-08-11 | 1705 | 1597
2024-08-12 | 1649 | 1639
2024-08-13 | 1678 | 1575
2024-08-14 | 1641 | 1655
2024-08-15 | 445 | 515
Browsers, operating systems and devices:
SELECT browser, COUNT(*) AS sessions
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
GROUP BY browser
ORDER BY browser;
browser | sessions
---------+----------
Chrome | 3833
Edge | 3807
Firefox | 3881
Opera | 3837
Safari | 3770
Unknown | 3655
SELECT os, COUNT(*) AS sessions
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
GROUP BY os
ORDER BY os;
os | sessions
---------+----------
Android | 3792
iOS | 3736
Linux | 3749
Mac OS | 3847
Unknown | 3826
Windows | 3832
SELECT device, COUNT(*) AS sessions
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
GROUP BY device
ORDER BY device;
device | sessions
---------+----------
Desktop | 5690
Mobile | 5654
Tablet | 5670
Unknown | 5768
Event types count:
SELECT type, COUNT(*) AS events
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
GROUP BY type
ORDER BY type;
type | events
--------------+--------
account-view | 3804
home-view | 3778
profile-edit | 3715
profile-view | 3830
search-input | 3899
search-view | 3749
Some custom data from events:
SELECT data->'input' AS search_input
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
AND type = 'search-input'
LIMIT 10;
search_input
------------------------------
"Inflation facts"
"Gold vs BTC"
"BTC"
"Gold"
"Gold"
"LLMs vs Humans"
"Is AI really that powerful"
"Inflation facts"
"Inflation facts"
"Gold vs BTC"
SELECT data->'userId' AS user_id
FROM analytics_event
WHERE timestamp >= (NOW() - INTERVAL '7 day')
AND type = 'profile-view'
LIMIT 10;
user_id
----------------------------------------
"2e30beb1-f010-4746-af78-ba8a8170b3cb"
"b04c50b1-c2a1-4048-ad7e-6b721014533c"
"cf38ecb9-2e6d-40a8-9694-43f862c389a9"
"a4b460a1-fe80-42d4-b1c1-363463613390"
"2856fd14-80db-417e-8c62-28809c8cf815"
"87079753-fcaa-4391-97de-009728b06c54"
"46f73a4a-860c-480d-97dd-8a3fa874f15a"
"f30dd417-4f33-47f7-8bb9-8020a944a77d"
"b9444e78-e0e6-4a75-a1d9-218d9c0c3c53"
"02654b38-42ad-479f-8227-a7a6753bc2fc"
It is also trivial to use Grafana or any other visualization tool that supports SQL to create nice-looking dashboards and reports from this data and analyze it there. Nevertheless, regularly running a few raw SQL queries might be more than enough.
Performance
As mentioned already, thanks to the batch inserts we can handle quite heavy loads. In the environment:
- Postgres with 1GB of RAM and 1 CPU
- analytics-backend with 2GB of RAM and 4 CPUs
...
2024-08-15T08:04:20.566Z INFO 1 --- 93000/100000 events were inserted...
2024-08-15T08:04:20.612Z INFO 1 --- 94000/100000 events were inserted...
2024-08-15T08:04:20.700Z INFO 1 --- 95000/100000 events were inserted...
2024-08-15T08:04:20.851Z INFO 1 --- 96000/100000 events were inserted...
2024-08-15T08:04:20.921Z INFO 1 --- 97000/100000 events were inserted...
2024-08-15T08:04:20.994Z INFO 1 --- 98000/100000 events were inserted...
2024-08-15T08:04:21.147Z INFO 1 --- 99000/100000 events were inserted...
2024-08-15T08:04:21.183Z INFO 1 --- 100000/100000 events were inserted...
2024-08-15T08:04:21.183Z INFO 1 --- 100000 events inserted! It took: PT13.277879125S
...
which is:
100 000 / 13.28 = 7530 saved events per second
It is also happening with rather limited db (1 CPU, 100% available) and backend (4 CPUs, 400% available) resources consumption:
docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0db8f828a0c2 analytics-backend 17.09% 630.6MiB / 1.953GiB 31.53% 0B / 0B 0B / 152kB 43
1002768e044c analytics-db 0.73% 109.5MiB / 1000MiB 10.95% 0B / 0B 0B / 468MB 9
docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0db8f828a0c2 analytics-backend 125.45% 631.5MiB / 1.953GiB 31.57% 0B / 0B 0B / 164kB 43
1002768e044c analytics-db 15.34% 110.1MiB / 1000MiB 11.01% 0B / 0B 0B / 472MB 9
docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0db8f828a0c2 analytics-backend 172.69% 622.9MiB / 1.953GiB 31.15% 0B / 0B 0B / 164kB 43
1002768e044c analytics-db 23.75% 110.4MiB / 1000MiB 11.04% 0B / 0B 0B / 500MB 9
docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0db8f828a0c2 analytics-backend 195.60% 625.7MiB / 1.953GiB 31.29% 0B / 0B 0B / 176kB 43
1002768e044c analytics-db 22.40% 111MiB / 1000MiB 11.10% 0B / 0B 0B / 530MB 9
Final thoughts
As we saw, creating a simple yet scalable web analytics solution from scratch is not that complicated. But is it production-ready? Yes, there are just a few things to be mindful of:
POST: /analytics/events
endpoint must be public - we need to take care of the input validation, mainly limiting its size as we did here but we should also add rate limiting, which can be easily done with the help of a reverse proxy (Nginx for example). Without it, we are susceptible to a Denial of Service attack: for example, somebody might send us lots of random event requests exhausting our database's disk space and we will no longer be able to save any events- Parsing
userAgent
string on the frontend can get quite tricky; as an alternative, we can parse theUser-Agent
header on the backend, which is sent automatically by the browser - If we expect to have significantly higher load for analytics events compared to the regular traffic on our app's backend - it might be a wise idea to scale horizontally and deploy analytics backend separately and not as a part of our app's backend, since it can slow it down. Scaling vertically, by just giving our backend more resources to handle additional analytics traffic is another valid option
- Have separate analytics and application databases - higher load on the analytics database should not have an impact on our app
- Presented here Java + Postgres implementation is just an example. This approach is generic and can be reimplemented in any backend programming language with any JSON-supporting SQL database
- In the event of a longer, not temporary, database crash, we might lose some events - if we insist on having 100% guarantee that all analytics events are saved, more resilient code should be added. It would save events temporarily to a file or another, alternative storage and move them to the database once it is healthy again; but let's keep in mind that these are only analytical events: having a redundant setup like this is most likely not worth its complexity
Keeping these points in mind, we can enjoy the power of JSON in SQL and a batch insert trick to handle thousands of events per second with this straightforward yet highly scalable, private, flexible, and generic approach.
Have a great time collecting and analyzing your events!
Links
- Example implementation source code: https://github.com/BinaryIgor/code-examples/tree/master/simple-web-analytics
- JSON support in various SQL databases:
- PostgreSQL: https://www.postgresql.org/docs/current/datatype-json.html
- MySQL: https://dev.mysql.com/doc/refman/8.4/en/json.html
- MariaDB: https://mariadb.com/resources/blog/using-json-in-mariadb/
- SQLite: https://www.sqlite.org/json1.html
- SQL Server: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server
- Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/23/adjsn/json-in-oracle-database.html
- User Agent: