Re: [openstreetmap/openstreetmap-website] Add UserActivities module for structured user activity history (PR #5761)

Tom Hughes via rails-dev Tue, 18 Mar 2025 12:18:38 -0700

@tomhughes requested changes on this pull request.

I'm not sure why you say union is hard? We have 
https://github.com/brianhempel/active_record_union installed already so you can 
just use the `union` or `union_all` method on a relation to merge it with 
another one?


> +  # Common SQL for activity days
+  private_class_method def self.activity_days_sql(_user_id)
+    <<~SQL.squish
+      SELECT DATE_TRUNC('day', changesets.created_at) AS day FROM changesets 
WHERE user_id = :user_id
+      UNION ALL
+      SELECT DATE_TRUNC('day', diary_entries.created_at) AS day FROM 
diary_entries WHERE user_id = :user_id AND visible = true
+      UNION ALL
+      SELECT DATE_TRUNC('day', changeset_comments.created_at) AS day FROM 
changeset_comments WHERE author_id = :user_id
+      UNION ALL
+      SELECT DATE_TRUNC('day', note_comments.created_at) AS day FROM 
note_comments WHERE author_id = :user_id
+      UNION ALL
+      SELECT DATE_TRUNC('day', diary_comments.created_at) AS day FROM 
diary_comments WHERE user_id = :user_id AND visible = true
+      UNION ALL
+      SELECT DATE_TRUNC('day', gpx_files.timestamp) AS day FROM gpx_files 
WHERE user_id = :user_id
+    SQL
+  end

The problem with doing the union first, and only applying the order and limit 
to the result is that we can't use any indexes and will wind up fetching all 
the user's activity for all time and the sorting the combined result before 
applying the limit as this plan shows:

```
                                                                       QUERY 
PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2981.15..2981.16 rows=5 width=8)
   ->  Sort  (cost=2981.15..2995.09 rows=5575 width=8)
         Sort Key: (date_trunc('day'::text, changesets.created_at)) DESC
         ->  Append  (cost=0.57..2888.55 rows=5575 width=8)
               ->  Index Only Scan using changesets_user_id_created_at_idx on 
changesets  (cost=0.57..1406.24 rows=4373 width=8)
                     Index Cond: (user_id = 3980)
               ->  Index Scan using diary_entry_user_id_created_at_index on 
diary_entries  (cost=0.42..5.02 rows=4 width=8)
                     Index Cond: (user_id = 3980)
                     Filter: visible
               ->  Index Only Scan using 
index_changeset_comments_on_author_id_and_created_at on changeset_comments  
(cost=0.43..19.49 rows=96 width=8)
                     Index Cond: (author_id = 3980)
               ->  Index Only Scan using 
index_note_comments_on_author_id_and_created_at on note_comments  
(cost=0.43..151.67 rows=418 width=8)
                     Index Cond: (author_id = 3980)
               ->  Index Scan using index_diary_comments_on_user_id_and_id on 
diary_comments  (cost=0.41..263.85 rows=268 width=8)
                     Index Cond: (user_id = 3980)
                     Filter: visible
               ->  Index Scan using gpx_files_user_id_idx on gpx_files  
(cost=0.43..206.02 rows=416 width=8)
                     Index Cond: (user_id = 3980)
(18 rows)
```

if we use subqueries to sort and limit each branch before the union like this:

```sql
(SELECT DATE_TRUNC('day', changesets.created_at) AS day FROM changesets WHERE 
user_id = :user_id ORDER BY created_at DESC LIMIT :limit)
UNION ALL
(SELECT DATE_TRUNC('day', diary_entries.created_at) AS day FROM diary_entries 
WHERE user_id = :user_id AND visible = true ORDER BY created_at DESC LIMIT 
:limit)
UNION ALL
(SELECT DATE_TRUNC('day', changeset_comments.created_at) AS day FROM 
changeset_comments WHERE author_id = :user_id ORDER BY created_at DESC LIMIT 
:limit)
UNION ALL
(SELECT DATE_TRUNC('day', note_comments.created_at) AS day FROM note_comments 
WHERE author_id = :user_id ORDER BY created_at DESC LIMIT :limit)
UNION ALL
(SELECT DATE_TRUNC('day', diary_comments.created_at) AS day FROM diary_comments 
WHERE user_id = :user_id AND visible = true ORDER BY created_at DESC LIMIT 
:limit)
UNION ALL
(SELECT DATE_TRUNC('day', gpx_files.timestamp) AS day FROM gpx_files WHERE 
user_id = :user_id ORDER BY id DESC LIMIT :limit)
```

then we get a much better plan where we only read N items from each activity 
class before sorting the result and picking the top N overall items:

```
                                                                                
  QUERY PLAN                                                                    
              
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=25.51..25.53 rows=5 width=8)
   ->  Sort  (cost=25.51..25.59 rows=29 width=8)
         Sort Key: "*SELECT* 1_1".day DESC
         ->  Append  (cost=0.57..25.03 rows=29 width=8)
               ->  Subquery Scan on "*SELECT* 1_1"  (cost=0.57..2.68 rows=5 
width=8)
                     ->  Limit  (cost=0.57..2.18 rows=5 width=16)
                           ->  Index Only Scan Backward using 
changesets_user_id_created_at_idx on changesets  (cost=0.57..1406.24 rows=4373 
width=16)
                                 Index Cond: (user_id = 3980)
               ->  Subquery Scan on "*SELECT* 2"  (cost=0.42..5.42 rows=4 
width=8)
                     ->  Limit  (cost=0.42..5.02 rows=4 width=16)
                           ->  Index Scan Backward using 
diary_entry_user_id_created_at_index on diary_entries  (cost=0.42..5.02 rows=4 
width=16)
                                 Index Cond: (user_id = 3980)
                                 Filter: visible
               ->  Subquery Scan on "*SELECT* 3"  (cost=0.43..1.92 rows=5 
width=8)
                     ->  Limit  (cost=0.43..1.42 rows=5 width=16)
                           ->  Index Only Scan Backward using 
index_changeset_comments_on_author_id_and_created_at on changeset_comments  
(cost=0.43..19.49 rows=96 width=16)
                                 Index Cond: (author_id = 3980)
               ->  Subquery Scan on "*SELECT* 4"  (cost=0.43..2.74 rows=5 
width=8)
                     ->  Limit  (cost=0.43..2.24 rows=5 width=16)
                           ->  Index Only Scan Backward using 
index_note_comments_on_author_id_and_created_at on note_comments  
(cost=0.43..151.67 rows=418 width=16)
                                 Index Cond: (author_id = 3980)
               ->  Subquery Scan on "*SELECT* 5"  (cost=0.41..5.83 rows=5 
width=8)
                     ->  Limit  (cost=0.41..5.33 rows=5 width=16)
                           ->  Index Scan Backward using 
diary_comment_user_id_created_at_index on diary_comments  (cost=0.41..263.85 
rows=268 width=16)
                                 Index Cond: (user_id = 3980)
                                 Filter: visible
               ->  Subquery Scan on "*SELECT* 6"  (cost=0.43..4.99 rows=5 
width=8)
                     ->  Limit  (cost=0.43..4.49 rows=5 width=16)
                           ->  Index Scan Backward using 
index_gpx_files_on_user_id_and_id on gpx_files  (cost=0.43..337.88 rows=416 
width=16)
                                 Index Cond: (user_id = 3980)
(30 rows)
```

The only real problem is that when you do that you can't support offset as you 
don't know what offset to start at in each table, but do we actually need 
offset? We're only interested in the most recent activity? If we do need it 
than the subqueries would each need to use offset+limit as their limit which is 
still better until offset gets too large at least.

Note that CTEs are an alternative to inline subqueries here - the result should 
be the same.

> +      SELECT DATE_TRUNC('day', diary_comments.created_at) AS day FROM 
> diary_comments WHERE user_id = :user_id AND visible = true
+      UNION ALL
+      SELECT DATE_TRUNC('day', gpx_files.timestamp) AS day FROM gpx_files 
WHERE user_id = :user_id
+    SQL
+  end
+
+  # Fetch activities for specific days
+  private_class_method def self.fetch_activities(user_id, days)
+    [
+      fetch_changesets(user_id, days),
+      fetch_diary_entries(user_id, days),
+      fetch_changeset_comments(user_id, days),
+      fetch_note_comments(user_id, days),
+      fetch_diary_comments(user_id, days),
+      fetch_gpx_files(user_id, days)
+    ].flatten

I'm not really sure why joining the relations returned by these functions with 
`union` or `union_all` is so hard? It would mean one database round trip 
instead of six? 

-- 
Reply to this email directly or view it on GitHub:
https://github.com/openstreetmap/openstreetmap-website/pull/5761#pullrequestreview-2695950194
You are receiving this because you are subscribed to this thread.

Message ID: 
<openstreetmap/openstreetmap-website/pull/5761/review/2695950...@github.com>

_______________________________________________
rails-dev mailing list
rails-dev@openstreetmap.org
https://lists.openstreetmap.org/listinfo/rails-dev

Re: [openstreetmap/openstreetmap-website] Add UserActivities module for structured user activity history (PR #5761)

Reply via email to