On Mon, 28 Nov 2005, Qingqing Zhou wrote:

>
> I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
> sequential scan IO speed. The basic idea of this thread is just like the
> "read-ahead" method, but the difference is this one does not read the data
> into shared buffer pool directly, instead, it reads the data into file
> system cache, which makes the integration easy and this is unique to
> PostgreSQL.
>

MySQL, Oracle and others implement read-ahead threads to simulate async IO
'pre-fetching'. I've been experimenting with two ideas. The first is to
increase the readahead when we're doing sequential scans (see prototype
patch using posix fadvise attached). I've not got any hardware at the
moment which I can test this patch on but I am waiting on some dbt-3
results which should indicate whether fadvise is a good idea or a bad one.

The second idea is using posix async IO at key points within the system
to better parallelise CPU and IO work. There areas I think we could use
async IO are: during sequential scans, use async IO to do pre-fetching of
blocks; inside WAL, begin flushing WAL buffers to disk before we commit;
and, inside the background writer/check point process, asynchronously
write out pages and, potentially, asynchronously build new checkpoint segments.

The motivation for using async IO is two fold: first, the results of this
paper[1] are compelling; second, modern OSs support async IO. I know that
Linux[2], Solaris[3], AIX and Windows all have async IO and I presume that
all their rivals have it as well.

The fundamental premise of the paper mentioned above is that if the
database is busy, IO should be busy. With our current block-at-a-time
processing, this isn't always the case. This is why Qingqing's read-ahead
thread makes sense. My reason for mailing is, however, that the async IO
results are more compelling than the read ahead thread.

I haven't had time to prototype whether we can easily implement async IO
but I am planning to work on it in December. The two main goals will be to
a) integrate and utilise async IO, at least within the executor context,
and b) build a primitive kind of scheduler so that we stop prefetching
when we know that there are a certain number of outstanding IOs for a
given device.

Thanks,

Gavin



[1] http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf
[2] http://lse.sourceforge.net/io/aionotes.txt
[3] http://developers.sun.com/solaris/articles/event_completion.html - I'm
fairly sure they have a posix AIO wrapper around these routines, but I
cannot see it documented anywhere :-(
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.200
diff -c -p -r1.200 heapam.c
*** src/backend/access/heap/heapam.c    15 Oct 2005 02:49:08 -0000      1.200
--- src/backend/access/heap/heapam.c    18 Nov 2005 04:10:21 -0000
***************
*** 36,41 ****
--- 36,44 ----
   *
   *-------------------------------------------------------------------------
   */
+ 
+ #include <fcntl.h>
+ 
  #include "postgres.h"
  
  #include "access/heapam.h"
***************
*** 49,54 ****
--- 52,58 ----
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "storage/procarray.h"
+ #include "storage/smgr.h"
  #include "utils/inval.h"
  #include "utils/relcache.h"
  
*************** heap_beginscan(Relation relation, Snapsh
*** 659,665 ****
        pgstat_initstats(&scan->rs_pgstat_info, relation);
  
        initscan(scan, key);
! 
        return scan;
  }
  
--- 663,673 ----
        pgstat_initstats(&scan->rs_pgstat_info, relation);
  
        initscan(scan, key);
!       if(!IsBootstrapProcessingMode())
!       {
!               RelationOpenSmgr(relation);
!               RelationSetSmgrAdvice(relation, POSIX_FADV_SEQUENTIAL);
!       }
        return scan;
  }
  
*************** heap_rescan(HeapScanDesc scan,
*** 693,698 ****
--- 701,710 ----
  void
  heap_endscan(HeapScanDesc scan)
  {
+ /*    if(!IsBootstrapProcessingMode())
+               smgradvise(scan->rs_rd->rd_smgr, POSIX_FADV_NORMAL);
+ */
+ 
        /* Note: no locking manipulations needed */
  
        /*
Index: src/backend/access/index/indexam.c
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/backend/access/index/indexam.c,v
retrieving revision 1.86
diff -c -p -r1.86 indexam.c
*** src/backend/access/index/indexam.c  15 Oct 2005 02:49:09 -0000      1.86
--- src/backend/access/index/indexam.c  18 Nov 2005 03:13:01 -0000
***************
*** 61,73 ****
   *-------------------------------------------------------------------------
   */
  
  #include "postgres.h"
  
  #include "access/genam.h"
  #include "access/heapam.h"
! #include "pgstat.h"
  #include "utils/relcache.h"
  
  
  /* ----------------------------------------------------------------
   *                                    macros used in index_ routines
--- 61,77 ----
   *-------------------------------------------------------------------------
   */
  
+ #include <fcntl.h>
+ 
  #include "postgres.h"
  
  #include "access/genam.h"
  #include "access/heapam.h"
! #include "miscadmin.h"
! #include "storage/smgr.h"
  #include "utils/relcache.h"
  
+ #include "pgstat.h"
  
  /* ----------------------------------------------------------------
   *                                    macros used in index_ routines
*************** index_beginscan(Relation heapRelation,
*** 247,253 ****
        scan->is_multiscan = false;
        scan->heapRelation = heapRelation;
        scan->xs_snapshot = snapshot;
! 
        return scan;
  }
  
--- 251,263 ----
        scan->is_multiscan = false;
        scan->heapRelation = heapRelation;
        scan->xs_snapshot = snapshot;
!       if(!IsBootstrapProcessingMode())
!       {
! //            RelationOpenSmgr(heapRelation);
! //            smgradvise(heapRelation->rd_smgr, POSIX_FADV_RANDOM);
!               RelationOpenSmgr(indexRelation);
!         RelationSetSmgrAdvice(indexRelation, POSIX_FADV_RANDOM);
!       }
        return scan;
  }
  
*************** index_endscan(IndexScanDesc scan)
*** 365,370 ****
--- 375,386 ----
  {
        FmgrInfo   *procedure;
  
+       if(!IsBootstrapProcessingMode())
+       {
+       //    smgradvise(scan->heapRelation->rd_smgr, POSIX_FADV_NORMAL);
+ //            smgradvise(scan->indexRelation->rd_smgr, POSIX_FADV_NORMAL);
+       }
+ 
        SCAN_CHECKS;
        GET_SCAN_PROCEDURE(amendscan);
  
Index: src/backend/storage/file/fd.c
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.121
diff -c -p -r1.121 fd.c
*** src/backend/storage/file/fd.c       15 Oct 2005 02:49:25 -0000      1.121
--- src/backend/storage/file/fd.c       16 Nov 2005 03:12:31 -0000
*************** FileTruncate(File file, long offset)
*** 1160,1165 ****
--- 1160,1170 ----
        return returnCode;
  }
  
+ int
+ FileAdvise(File file, int advice)
+ {
+       return posix_fadvise(VfdCache[file].fd, 0, 0, advice);
+ }
  
  /*
   * Routines that want to use stdio (ie, FILE*) should use AllocateFile
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.118
diff -c -p -r1.118 md.c
*** src/backend/storage/smgr/md.c       15 Oct 2005 02:49:26 -0000      1.118
--- src/backend/storage/smgr/md.c       18 Nov 2005 02:58:17 -0000
*************** mdopen(SMgrRelation reln, bool allowNotF
*** 365,370 ****
--- 365,372 ----
                }
        }
  
+       FileAdvise(fd, reln->advice);
+ 
        pfree(path);
  
        reln->md_fd = mdfd = _fdvec_alloc();
*************** mdwrite(SMgrRelation reln, BlockNumber b
*** 493,498 ****
--- 495,507 ----
        return true;
  }
  
+ int
+ mdadvise(SMgrRelation reln, int advice)
+ {
+       MdfdVec *v = _mdfd_getseg(reln, 0, false);
+       return FileAdvise(v->mdfd_vfd, advice);
+ }
+ 
  /*
   *    mdnblocks() -- Get the number of blocks stored in a relation.
   *
*************** _mdfd_openseg(SMgrRelation reln, BlockNu
*** 882,887 ****
--- 891,898 ----
        if (fd < 0)
                return NULL;
  
+       
+ 
        /* allocate an mdfdvec entry for it */
        v = _fdvec_alloc();
  
Index: src/backend/storage/smgr/smgr.c
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/backend/storage/smgr/smgr.c,v
retrieving revision 1.93
diff -c -p -r1.93 smgr.c
*** src/backend/storage/smgr/smgr.c     15 Oct 2005 02:49:26 -0000      1.93
--- src/backend/storage/smgr/smgr.c     18 Nov 2005 03:01:45 -0000
***************
*** 15,20 ****
--- 15,23 ----
   *
   *-------------------------------------------------------------------------
   */
+ 
+ #include <fcntl.h>
+ 
  #include "postgres.h"
  
  #include "access/xact.h"
*************** typedef struct f_smgr
*** 54,59 ****
--- 57,63 ----
        bool            (*smgr_commit) (void);  /* may be NULL */
        bool            (*smgr_abort) (void);   /* may be NULL */
        bool            (*smgr_sync) (void);    /* may be NULL */
+       int                     (*smgr_advise) (SMgrRelation reln, int advice);
  } f_smgr;
  
  
*************** static const f_smgr smgrsw[] = {
*** 61,67 ****
        /* magnetic disk */
        {mdinit, NULL, mdclose, mdcreate, mdunlink, mdextend,
                mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
!               NULL, NULL, mdsync
        }
  };
  
--- 65,71 ----
        /* magnetic disk */
        {mdinit, NULL, mdclose, mdcreate, mdunlink, mdextend,
                mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
!               NULL, NULL, mdsync, mdadvise
        }
  };
  
*************** smgropen(RelFileNode rnode)
*** 219,224 ****
--- 223,229 ----
                reln->smgr_owner = NULL;
                reln->smgr_which = 0;   /* we only have md.c at present */
                reln->md_fd = NULL;             /* mark it not open */
+               reln->advice = POSIX_FADV_NORMAL;
        }
  
        return reln;
*************** smgrcreate(SMgrRelation reln, bool isTem
*** 390,395 ****
--- 395,406 ----
        pendingDeletes = pending;
  }
  
+ int
+ smgradvise(SMgrRelation reln, int advice)
+ {
+       return (*(smgrsw[reln->smgr_which].smgr_advise)) (reln, advice);
+ }
+ 
  /*
   *    smgrscheduleunlink() -- Schedule unlinking a relation at xact commit.
   *
Index: src/include/storage/fd.h
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/include/storage/fd.h,v
retrieving revision 1.54
diff -c -p -r1.54 fd.h
*** src/include/storage/fd.h    15 Oct 2005 02:49:46 -0000      1.54
--- src/include/storage/fd.h    16 Nov 2005 01:53:49 -0000
*************** extern int      FileWrite(File file, char *bu
*** 69,74 ****
--- 69,75 ----
  extern int    FileSync(File file);
  extern long FileSeek(File file, long offset, int whence);
  extern int    FileTruncate(File file, long offset);
+ extern int FileAdvise(File file, int advice);
  
  /* Operations that allow use of regular stdio --- USE WITH CAUTION */
  extern FILE *AllocateFile(char *name, char *mode);
Index: src/include/storage/smgr.h
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/include/storage/smgr.h,v
retrieving revision 1.53
diff -c -p -r1.53 smgr.h
*** src/include/storage/smgr.h  15 Oct 2005 02:49:46 -0000      1.53
--- src/include/storage/smgr.h  17 Nov 2005 23:28:37 -0000
*************** typedef struct SMgrRelationData
*** 52,57 ****
--- 52,58 ----
        int                     smgr_which;             /* storage manager 
selector */
  
        struct _MdfdVec *md_fd;         /* for md.c; NULL if not open */
+       int     advice;                                 /* kernel advise about 
the file */
  } SMgrRelationData;
  
  typedef SMgrRelationData *SMgrRelation;
*************** extern void PostPrepare_smgr(void);
*** 83,88 ****
--- 84,90 ----
  extern void smgrcommit(void);
  extern void smgrabort(void);
  extern void smgrsync(void);
+ extern int smgradvise(SMgrRelation reln, int advice);
  
  extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void smgr_desc(char *buf, uint8 xl_info, char *rec);
*************** extern bool mdsync(void);
*** 108,113 ****
--- 110,118 ----
  
  extern void RememberFsyncRequest(RelFileNode rnode, BlockNumber segno);
  
+ extern int mdadvise(SMgrRelation reln, int advice);
+ 
+ 
  /* smgrtype.c */
  extern Datum smgrout(PG_FUNCTION_ARGS);
  extern Datum smgrin(PG_FUNCTION_ARGS);
Index: src/include/utils/rel.h
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/include/utils/rel.h,v
retrieving revision 1.87
diff -c -p -r1.87 rel.h
*** src/include/utils/rel.h     15 Oct 2005 02:49:46 -0000      1.87
--- src/include/utils/rel.h     18 Nov 2005 04:10:18 -0000
*************** typedef Relation *RelationPtr;
*** 278,283 ****
--- 278,289 ----
                } \
        } while (0)
  
+ #define RelationSetSmgrAdvice(relation, _advice) \
+       do { \
+               if ((relation)->rd_smgr != NULL) \
+                       (relation)->rd_smgr->advice = _advice; \
+       } while(0)
+ 
  /*
   * RELATION_IS_LOCAL
   *            If a rel is either temp or newly created in the current 
transaction,
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

Reply via email to