Skip to main content

De-duplicating events

One common task that you might want to do is to de-duplicate rows. For instance imagine you have a collection that stores sales data. A sale might get updated multiple times to reflect things like returns, disputes, refunds, etc.

Typically this means you'll have multiple rows for the same sale. For instance maybe you have a table that looks like this

timestampsale_idpricerefunded
119.990
219.991

When doing analysis we might want to de-duplicate these rows before doing further analysis. One way to accomplish this is by using the DISTINCT keyword ordered by timestamp. So for instance you might want to run a query as follows

SELECT
DISTINCT ON (sale_id) *
FROM
(
SELECT
JSONExtractString(json, 'sale_id') as sale_id,
JSONExtractFloat(json, 'price') as price,
JSONExtractBool(json, 'refunded') as refunded,
timestamp
FROM
sales
)
ORDER BY
timestamp DESC

You can then put this in a common table expression and do further analysis accordingly

WITH deduplicated_sales AS (
SELECT
DISTINCT ON (sale_id) *
FROM
(
SELECT
JSONExtractString(json, 'sale_id') as sale_id,
JSONExtractFloat(json, 'price') as price,
JSONExtractBool(json, 'refunded') as refunded,
timestamp
FROM
sales
)
ORDER BY
timestamp DESC
)
-- Calculate total revenue
SELECT SUM(price) as revenue FROM deduplicated_sales;

Oh one last thing. You might wonder why we use timestamp DESC, this simply gets the latest row for a particular ID. If we wanted the earliest row, we would use timestamp ASC