On Fri, 11 Oct 2024 10:53:10 +0900 Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
> > > On 2024/10/04 2:12, Masahiko Sawada wrote: > > Hi, > > > > On Thu, Oct 3, 2024 at 2:23 AM Fujii Masao <masao.fu...@oss.nttdata.com> > > wrote: > >> > >> Hi, > >> > >> Currently, file_fdw updates several columns in the pg_stat_progress_copy > >> view, > >> like relid and bytes_processed, but it doesn't track tuples_processed or > >> tuples_skipped. Monitoring these would be particularly useful when handling > >> large data sets via file_fdw, as it helps track the progress of scan. > >> > >> The attached patch updates file_fdw to add support for reporting > >> the number of tuples processed and skipped (due to on_error = 'ignore') > >> in the pg_stat_progress_copy view. What are your thoughts? > > > > While the patch works fine and looks good to me, in the first place, > > it seems to me that the fact that file_fdw uses the COPY progress > > itself doesn't work properly. For example, unlike COPY command, > > queries could have multiple scans on one or more flie_fdw foreign > > tables when joining tables. I found the discussion for that[1]: there > > was a proposal of disabling COPY progress for file_fdw but the votes > > are split. I think it would be better to consider if we really want to > > support COPY progress for file_fdw before supporting more progress > > information. > > Yes, you're right. We need to address how to handle multiple commands > that trigger progress reporting when executed concurrently, at first. > > The current progress reporting mechanism assumes only one command > triggering progress is running at a time, as each backend has just > one memory area for progress reporting. If multiple commands run > simultaneously, > the progress data would be incorrect. As you mentioned, this could happen > when querying multiple file_fdw foreign tables, where multiple COPY commands > could execute concurrently. > > However, this issue already exists without the proposed patch. > Since file_fdw already reports progress partially, querying multiple > file_fdw tables can lead to inaccurate or confusing progress reports. > You can even observe this when analyzing a file_fdw table and also > when copying to the table with a trigger that executes progress-reporting > commands. > > So, I don’t think this issue should block the proposed patch. > In fact, progress reporting is already flawed in these scenarios, > regardless of whether the patch is applied. > > On the other hand, in many cases where a single file_fdw table is scanned, > COPY progress reporting works correctly for file_fdw and is useful. > Therefore, I believe it's still worth improving file_fdw’s progress reporting. I think reporting tuples_processed and tuples_skipped columns additionally in file_fdw is reasonable, since it already reports bytes_processed and bytes_total. By the way, in the documentation of fild_fdw, it is not explicitly described that file_fdw uses COPY internally, although I can find several wordings like "as COPY". To prevent users to face unexpected experiences, how about explaining explicitly that file_fdw uses COPY and updates pg_stat_progress_copy? > To prevent misleading reports when multiple commands are run concurrently, > just idea, we could consider displaying NULL columns in the progress report > if this situation is detected, as a separate patch. Or, could we add additional field to pg_stat_progress_copy to show how much commands are running COPY? It is also just idea. Regards, Yugo Nagata -- Yugo Nagata <nag...@sraoss.co.jp>