@tomhughes requested changes on this pull request.
I'm not sure why you say union is hard? We have
https://github.com/brianhempel/active_record_union installed already so you can
just use the `union` or `union_all` method on a relation to merge it with
another one?
> + # Common SQL for activity days
+ private_class_method def self.activity_days_sql(_user_id)
+ <<~SQL.squish
+ SELECT DATE_TRUNC('day', changesets.created_at) AS day FROM changesets
WHERE user_id = :user_id
+ UNION ALL
+ SELECT DATE_TRUNC('day', diary_entries.created_at) AS day FROM
diary_entries WHERE user_id = :user_id AND visible = true
+ UNION ALL
+ SELECT DATE_TRUNC('day', changeset_comments.created_at) AS day FROM
changeset_comments WHERE author_id = :user_id
+ UNION ALL
+ SELECT DATE_TRUNC('day', note_comments.created_at) AS day FROM
note_comments WHERE author_id = :user_id
+ UNION ALL
+ SELECT DATE_TRUNC('day', diary_comments.created_at) AS day FROM
diary_comments WHERE user_id = :user_id AND visible = true
+ UNION ALL
+ SELECT DATE_TRUNC('day', gpx_files.timestamp) AS day FROM gpx_files
WHERE user_id = :user_id
+ SQL
+ end
The problem with doing the union first, and only applying the order and limit
to the result is that we can't use any indexes and will wind up fetching all
the user's activity for all time and the sorting the combined result before
applying the limit as this plan shows:
```
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2981.15..2981.16 rows=5 width=8)
-> Sort (cost=2981.15..2995.09 rows=5575 width=8)
Sort Key: (date_trunc('day'::text, changesets.created_at)) DESC
-> Append (cost=0.57..2888.55 rows=5575 width=8)
-> Index Only Scan using changesets_user_id_created_at_idx on
changesets (cost=0.57..1406.24 rows=4373 width=8)
Index Cond: (user_id = 3980)
-> Index Scan using diary_entry_user_id_created_at_index on
diary_entries (cost=0.42..5.02 rows=4 width=8)
Index Cond: (user_id = 3980)
Filter: visible
-> Index Only Scan using
index_changeset_comments_on_author_id_and_created_at on changeset_comments
(cost=0.43..19.49 rows=96 width=8)
Index Cond: (author_id = 3980)
-> Index Only Scan using
index_note_comments_on_author_id_and_created_at on note_comments
(cost=0.43..151.67 rows=418 width=8)
Index Cond: (author_id = 3980)
-> Index Scan using index_diary_comments_on_user_id_and_id on
diary_comments (cost=0.41..263.85 rows=268 width=8)
Index Cond: (user_id = 3980)
Filter: visible
-> Index Scan using gpx_files_user_id_idx on gpx_files
(cost=0.43..206.02 rows=416 width=8)
Index Cond: (user_id = 3980)
(18 rows)
```
if we use subqueries to sort and limit each branch before the union like this:
```sql
(SELECT DATE_TRUNC('day', changesets.created_at) AS day FROM changesets WHERE
user_id = :user_id ORDER BY created_at DESC LIMIT :limit)
UNION ALL
(SELECT DATE_TRUNC('day', diary_entries.created_at) AS day FROM diary_entries
WHERE user_id = :user_id AND visible = true ORDER BY created_at DESC LIMIT
:limit)
UNION ALL
(SELECT DATE_TRUNC('day', changeset_comments.created_at) AS day FROM
changeset_comments WHERE author_id = :user_id ORDER BY created_at DESC LIMIT
:limit)
UNION ALL
(SELECT DATE_TRUNC('day', note_comments.created_at) AS day FROM note_comments
WHERE author_id = :user_id ORDER BY created_at DESC LIMIT :limit)
UNION ALL
(SELECT DATE_TRUNC('day', diary_comments.created_at) AS day FROM diary_comments
WHERE user_id = :user_id AND visible = true ORDER BY created_at DESC LIMIT
:limit)
UNION ALL
(SELECT DATE_TRUNC('day', gpx_files.timestamp) AS day FROM gpx_files WHERE
user_id = :user_id ORDER BY id DESC LIMIT :limit)
```
then we get a much better plan where we only read N items from each activity
class before sorting the result and picking the top N overall items:
```
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=25.51..25.53 rows=5 width=8)
-> Sort (cost=25.51..25.59 rows=29 width=8)
Sort Key: "*SELECT* 1_1".day DESC
-> Append (cost=0.57..25.03 rows=29 width=8)
-> Subquery Scan on "*SELECT* 1_1" (cost=0.57..2.68 rows=5
width=8)
-> Limit (cost=0.57..2.18 rows=5 width=16)
-> Index Only Scan Backward using
changesets_user_id_created_at_idx on changesets (cost=0.57..1406.24 rows=4373
width=16)
Index Cond: (user_id = 3980)
-> Subquery Scan on "*SELECT* 2" (cost=0.42..5.42 rows=4
width=8)
-> Limit (cost=0.42..5.02 rows=4 width=16)
-> Index Scan Backward using
diary_entry_user_id_created_at_index on diary_entries (cost=0.42..5.02 rows=4
width=16)
Index Cond: (user_id = 3980)
Filter: visible
-> Subquery Scan on "*SELECT* 3" (cost=0.43..1.92 rows=5
width=8)
-> Limit (cost=0.43..1.42 rows=5 width=16)
-> Index Only Scan Backward using
index_changeset_comments_on_author_id_and_created_at on changeset_comments
(cost=0.43..19.49 rows=96 width=16)
Index Cond: (author_id = 3980)
-> Subquery Scan on "*SELECT* 4" (cost=0.43..2.74 rows=5
width=8)
-> Limit (cost=0.43..2.24 rows=5 width=16)
-> Index Only Scan Backward using
index_note_comments_on_author_id_and_created_at on note_comments
(cost=0.43..151.67 rows=418 width=16)
Index Cond: (author_id = 3980)
-> Subquery Scan on "*SELECT* 5" (cost=0.41..5.83 rows=5
width=8)
-> Limit (cost=0.41..5.33 rows=5 width=16)
-> Index Scan Backward using
diary_comment_user_id_created_at_index on diary_comments (cost=0.41..263.85
rows=268 width=16)
Index Cond: (user_id = 3980)
Filter: visible
-> Subquery Scan on "*SELECT* 6" (cost=0.43..4.99 rows=5
width=8)
-> Limit (cost=0.43..4.49 rows=5 width=16)
-> Index Scan Backward using
index_gpx_files_on_user_id_and_id on gpx_files (cost=0.43..337.88 rows=416
width=16)
Index Cond: (user_id = 3980)
(30 rows)
```
The only real problem is that when you do that you can't support offset as you
don't know what offset to start at in each table, but do we actually need
offset? We're only interested in the most recent activity? If we do need it
than the subqueries would each need to use offset+limit as their limit which is
still better until offset gets too large at least.
Note that CTEs are an alternative to inline subqueries here - the result should
be the same.
> + SELECT DATE_TRUNC('day', diary_comments.created_at) AS day FROM
> diary_comments WHERE user_id = :user_id AND visible = true
+ UNION ALL
+ SELECT DATE_TRUNC('day', gpx_files.timestamp) AS day FROM gpx_files
WHERE user_id = :user_id
+ SQL
+ end
+
+ # Fetch activities for specific days
+ private_class_method def self.fetch_activities(user_id, days)
+ [
+ fetch_changesets(user_id, days),
+ fetch_diary_entries(user_id, days),
+ fetch_changeset_comments(user_id, days),
+ fetch_note_comments(user_id, days),
+ fetch_diary_comments(user_id, days),
+ fetch_gpx_files(user_id, days)
+ ].flatten
I'm not really sure why joining the relations returned by these functions with
`union` or `union_all` is so hard? It would mean one database round trip
instead of six?
--
Reply to this email directly or view it on GitHub:
https://github.com/openstreetmap/openstreetmap-website/pull/5761#pullrequestreview-2695950194
You are receiving this because you are subscribed to this thread.
Message ID:
<openstreetmap/openstreetmap-website/pull/5761/review/2695950...@github.com>
_______________________________________________
rails-dev mailing list
rails-dev@openstreetmap.org
https://lists.openstreetmap.org/listinfo/rails-dev