De-duplicating events

One common task that you might want to do is to de-duplicate rows. For instance imagine you have a collection that stores sales data. A sale might get updated multiple times to reflect things like returns, disputes, refunds, etc.

Typically this means you'll have multiple rows for the same sale. For instance maybe you have a table that looks like this

timestamp	sale_id	price	refunded
1	1	9.99	0
2	1	9.99	1

When doing analysis we might want to de-duplicate these rows before doing further analysis. One way to accomplish this is by using the DISTINCT keyword ordered by timestamp. So for instance you might want to run a query as follows

SELECT
  DISTINCT ON (sale_id) *
FROM
  (
    SELECT
      JSONExtractString(json, 'sale_id') as sale_id,
      JSONExtractFloat(json, 'price') as price,
      JSONExtractBool(json, 'refunded') as refunded,
      timestamp
    FROM
      sales
  )
ORDER BY
  timestamp DESC

You can then put this in a common table expression and do further analysis accordingly

WITH deduplicated_sales AS (
  SELECT
    DISTINCT ON (sale_id) *
  FROM
    (
      SELECT
        JSONExtractString(json, 'sale_id') as sale_id,
        JSONExtractFloat(json, 'price') as price,
        JSONExtractBool(json, 'refunded') as refunded,
        timestamp
      FROM
        sales
    )
  ORDER BY
    timestamp DESC
)
-- Calculate total revenue
SELECT SUM(price) as revenue FROM deduplicated_sales;

Oh one last thing. You might wonder why we use timestamp DESC, this simply gets the latest row for a particular ID. If we wanted the earliest row, we would use timestamp ASC