De-duplicating events
One common task that you might want to do is to de-duplicate rows. For instance imagine you have a collection that stores sales data. A sale might get updated multiple times to reflect things like returns, disputes, refunds, etc.
Typically this means you'll have multiple rows for the same sale. For instance maybe you have a table that looks like this
timestamp | sale_id | price | refunded |
---|---|---|---|
1 | 1 | 9.99 | 0 |
2 | 1 | 9.99 | 1 |
When doing analysis we might want to de-duplicate these rows before doing further analysis. One way to accomplish this is by using the DISTINCT keyword ordered by timestamp. So for instance you might want to run a query as follows
SELECT
DISTINCT ON (sale_id) *
FROM
(
SELECT
JSONExtractString(json, 'sale_id') as sale_id,
JSONExtractFloat(json, 'price') as price,
JSONExtractBool(json, 'refunded') as refunded,
timestamp
FROM
sales
)
ORDER BY
timestamp DESC
You can then put this in a common table expression and do further analysis accordingly
WITH deduplicated_sales AS (
SELECT
DISTINCT ON (sale_id) *
FROM
(
SELECT
JSONExtractString(json, 'sale_id') as sale_id,
JSONExtractFloat(json, 'price') as price,
JSONExtractBool(json, 'refunded') as refunded,
timestamp
FROM
sales
)
ORDER BY
timestamp DESC
)
-- Calculate total revenue
SELECT SUM(price) as revenue FROM deduplicated_sales;
Oh one last thing. You might wonder why we use timestamp DESC
, this simply gets the latest row for a particular ID. If we wanted the earliest row, we would use timestamp ASC