Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
Hey,

I'm trying to build a postgres export tool that reads data from table pages
and exports it to an S3 bucket. I'd like to avoid manual commands like
pg_dump, I need access to the raw data.

Can you please point me to the postgres source header / cc files that
encapsulate this functionality?
 - List all pages for a table
- Read a given page for a table

Any pointers to the relevant source code would be appreciated.

Thanks,
Sushrut


Re: Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
I'd like to read individual rows from the pages as they are updated and
stream them to a server to create a copy of the data.
The data will be rewritten to columnar format for analytics queries.

On Tue, Mar 19, 2024 at 7:58 PM Alexander Korotkov 
wrote:

> Hi
>
> On Tue, Mar 19, 2024 at 4:23 PM Sushrut Shivaswamy
>  wrote:
> > I'm trying to build a postgres export tool that reads data from table
> pages and exports it to an S3 bucket. I'd like to avoid manual commands
> like pg_dump, I need access to the raw data.
> >
> > Can you please point me to the postgres source header / cc files that
> encapsulate this functionality?
> >  - List all pages for a table
> > - Read a given page for a table
> >
> > Any pointers to the relevant source code would be appreciated.
>
> Why do you need to work on the source code level?
> Please, check this about having a binary  copy of the database on the
> filesystem level.
> https://www.postgresql.org/docs/current/backup-file.html
>
> --
> Regards,
> Alexander Korotkov
>


Re: Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
The binary I"m trying to create should automatically be able to read data
from a postgres instance without users having to
run commands for backup / pg_dump etc.
Having access to the appropriate source headers would allow me to read the
data.

On Tue, Mar 19, 2024 at 8:03 PM Sushrut Shivaswamy <
sushrut.shivasw...@gmail.com> wrote:

> I'd like to read individual rows from the pages as they are updated and
> stream them to a server to create a copy of the data.
> The data will be rewritten to columnar format for analytics queries.
>
> On Tue, Mar 19, 2024 at 7:58 PM Alexander Korotkov 
> wrote:
>
>> Hi
>>
>> On Tue, Mar 19, 2024 at 4:23 PM Sushrut Shivaswamy
>>  wrote:
>> > I'm trying to build a postgres export tool that reads data from table
>> pages and exports it to an S3 bucket. I'd like to avoid manual commands
>> like pg_dump, I need access to the raw data.
>> >
>> > Can you please point me to the postgres source header / cc files that
>> encapsulate this functionality?
>> >  - List all pages for a table
>> > - Read a given page for a table
>> >
>> > Any pointers to the relevant source code would be appreciated.
>>
>> Why do you need to work on the source code level?
>> Please, check this about having a binary  copy of the database on the
>> filesystem level.
>> https://www.postgresql.org/docs/current/backup-file.html
>>
>> --
>> Regards,
>> Alexander Korotkov
>>
>


Re: Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
If we query the DB directly, is it possible to know which new rows have been 
added since the last query?
Is there a change pump that can be latched onto?

I’m assuming the page data structs are encapsulated in specific headers which 
can be used to list / read pages.
Why would Postgres need to be stopped to read the data? The read / query path 
in Postgres would also be reading these pages when the instance is running?



Re: Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
>
>
> lol, thanks for the inputs Alexander :)!


Columnar format export in Postgres

2024-06-12 Thread Sushrut Shivaswamy
Hey Postgres team,

I have been working on adding support for columnar format export to
Postgres to speed up analytics queries.
I've created an extension that achieves this functionality here
.

I"m looking to improve the performance of this extension to enable drop-in
analytics support for Postgres. Some immediate improvements I have in mind
are:
 - Reduce memory consumption when exporting table data to columnar format
 - Create a native planner / execution hook that can read columnar data
with vectorised operations.

It would be very helpful if you could take a look and suggest improvements
to the extension.
Hopefully, this extension can be shipped by default with postgres at some
point in the future.

Thanks,
Sushrut


Re: Columnar format export in Postgres

2024-06-13 Thread Sushrut Shivaswamy
Thanks for the response.

I had considered using COPY TO to export columnar data but gave up on it
since the formats weren't extensible.
It's great to see that you are making it extensible.

I'm still going through the thread of comments on your patch but I have
some early thoughts about using it for columnar data export.

 - To maintain data freshness there would need to be a way to schedule
exports using `COPY TO 'parquet`` periodically
  - pg_analytica has the scheduling logic, once available COPY TO can
be used to export the data instead of reading table in chunks being used
currently.

 - To facilitate efficient querying it would help to export multiple
parquet files for the table instead of a single file.
   Having multiple files allows queries to skip chunks if the key range in
the chunk does not match query filter criteria.
   Even within a chunk it would help to be able to configure the size of a
row group.
  - I'm not sure how these parameters will be exposed within `COPY TO`.
Or maybe the extension implementing the `COPY TO` handler will
allow this configuration?

 - Regarding using file_fdw to read Apache Arrow and Apache Parquet file
because file_fdw is based on COPY FROM:
 - I'm not too clear on this. file_fdw seems to allow creating a table
from  data on disk exported using COPY TO.
   But is the newly created table still using the data on disk(maybe in
columnar format or csv) or is it just reading that data to create a row
based table.
   I'm not aware of any capability in the postgres planner to read
columnar files currently without using an extension like parquet_fdw.
- For your usecase how do you plan to query the arrow / parquet
data?


Re: Columnar format export in Postgres

2024-06-13 Thread Sushrut Shivaswamy
>
> If you want to have any hope, the license must be BSD.
> GPL is incompatible.


Ack, will update the license to BSD. Thanks

On Wed, Jun 12, 2024 at 10:49 PM Ranier Vilela  wrote:

> Em qua., 12 de jun. de 2024 às 13:56, Sushrut Shivaswamy <
> sushrut.shivasw...@gmail.com> escreveu:
>
>> Hey Postgres team,
>>
>> I have been working on adding support for columnar format export to
>> Postgres to speed up analytics queries.
>> I've created an extension that achieves this functionality here
>> <https://github.com/sushrut141/pg_analytica>.
>>
>> I"m looking to improve the performance of this extension to enable
>> drop-in analytics support for Postgres. Some immediate improvements I have
>> in mind are:
>>  - Reduce memory consumption when exporting table data to columnar format
>>  - Create a native planner / execution hook that can read columnar data
>> with vectorised operations.
>>
>> It would be very helpful if you could take a look and suggest
>> improvements to the extension.
>> Hopefully, this extension can be shipped by default with postgres at some
>> point in the future.
>>
> If you want to have any hope, the license must be BSD.
> GPL is incompatible.
>
> best regards,
> Ranier Vilela
>


Background Processes in Postgres Extension

2024-04-23 Thread Sushrut Shivaswamy
Hey,

I'm developing a postgres extension as a custom Table Interface method
definition.
WIthin the extension, I"m planning to create two background processes using
`fork()` that will process data in the background.

Are there any recommendations / guidelines around creating background
processes within extensions in postgres?

Thanks,
Sushrut


Read table rows in chunks

2024-04-27 Thread Sushrut Shivaswamy
Hey,

I"m trying to read the rows of a table in chunks to process them in a
background worker.
I want to ensure that each row is processed only once.

I was thinking of using the `SELECT * ... OFFSET {offset_size} LIMIT
{limit_size}` functionality for this but I"m running into issues.

Some approaches I had in mind that aren't working out:
 - Try to use the transaction id to query rows created since the last
processed transaction id
  - It seems Postgres does not expose row transaction ids so this
approach is not feasible
 - Rely on OFFSET / LIMIT combination to query the next chunk of data
  - SELECT * does not guarantee ordering of rows so it's possible older
rows repeat or newer rows are missed in a chunk

Can you please suggest any alternative to periodically read rows from a
table in chunks while processing each row exactly once.

Thanks,
Sushrut


Re: Background Processes in Postgres Extension

2024-04-27 Thread Sushrut Shivaswamy
Thanks for the suggestion on using postgres background worker.

I tried creating one following the implementation in worker_spi and am able
to spawn a background worker successfully.

However, the background worker seems to cause postmaster to crash when I
wait for it to finish using `WaitForBackgroundWorkerShutdown.
The function called by the background worker is empty except for logging a
message to disk.

Any ideas on what could be going wrong / tips for debugging?

Thanks,
Sushrut


Reading timestamp values from Datums gives garbage values

2024-05-20 Thread Sushrut Shivaswamy
Hey,

I'm trying to read a timestamp column as EPOCH.
My query is as follows.
```
SELECT EXTRACT(EPOCH FROM timestamp_column) FROM table;

column
--

1716213097.86486
```
When running in the console this query gives valid epoch output which
appears to be of type double.

When trying to read the query response from the Datum, I get garbage values.
I've tried various types and none of them read the correct value.
```

Datum current_timestamp = SPI_getbinval(SPI_tuptable->vals[i],
SPI_tuptable->tupdesc, 5, &isnull);

double current_time = DatumGetFloat8(current_timestamp); // prints 0

int64 time = DatumGetUint64(current_timestamp); // prints 5293917674
```

Can you help me out with the correct way to read EPOCH values from datums?

Thanks,
Sushrut


Re: Reading timestamp values from Datums gives garbage values

2024-05-21 Thread Sushrut Shivaswamy
Thank you everyone for your responses.

I was a bit thrown off by the timestamp value the first time I printed it
by how small it was.
The revelation that postgres TimestampTz uses an epoch (time zero) of
2000-01-01 helped clarify
that value would indeed be smaller than regular UNIX epoch.

In my case I was trying to convert a diff of two timestamps into epoch
seconds which explains why the value
was just 1.5hr. My issue is now resolved.

Thanks again
 - Sushrut


Reading all tuples in Index Access Method

2024-08-28 Thread Sushrut Shivaswamy
Hi,

I'm trying to create an Index Access Method Roting.
Building the index requires iterating over all rows and calculating,
which is then used during index construction.

The methods in the IndexAmRoutine seem to deal with insertion / index build
one row at a time.
Is there any workaround you can suggest that would allow me to calculate
the median of a column,
store it someplace and then use it during Inserts to the index?

Thanks,
Sushrut


Re: Reading all tuples in Index Access Method

2024-08-28 Thread Sushrut Shivaswamy
Thanks Matthias, table_index_build_scan sounds like what I"m looking for.

On Wed, Aug 28, 2024 at 9:29 PM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Wed, 28 Aug 2024 at 16:21, Sushrut Shivaswamy
>  wrote:
> >
> > Hi,
> >
> > I'm trying to create an Index Access Method Roting.
> > Building the index requires iterating over all rows and calculating,
> > which is then used during index construction.
> >
> > The methods in the IndexAmRoutine seem to deal with insertion / index
> build one row at a time.
> > Is there any workaround you can suggest that would allow me to calculate
> the median of a column,
> > store it someplace and then use it during Inserts to the index?
>
> I'm not sure what to say. Index insertions through indam->aminsert
> happen as users insert new values into the table, so I don't see how a
> once-calculated median would remain correct across an index's
> lifespan: every time I insert a new value (or delete a tuple) the
> median will change. Furthermore, indexes will not know about deletions
> and updates until significantly after the deleting or updating
> transaction got committed, so transactionally consistent aggregates
> are likely impossible to keep consistent while staying inside the
> index AM API.
>
> However, if you only need this median (or other aggregate) at index
> build time, you should probably look at various indexes'
> indam->ambuild functions, as that function's purpose is to build a new
> index from an existing table's dataset, usually by scanning the table
> with table_index_build_scan.
>
> As for storing such data more permanently: Practically all included
> indexes currently have a metapage at block 0 of the main data fork,
> which contains metadata and bookkeeping info about the index's
> structure, and you're free to do the same for your index.
>
> I hope that helps?
>
> Kind regards,
>
> Matthias van de Meent
> Neon (https://neon.tech)
>