> Maybe "group by", "order by", "distinct on" and hand-written functions
> and aggregates (like first() or best()) may help.
We use these - we have lexical analysis functions which assign a rating to
each row in a set, and the likelyhood that the data is a match, and then we
sort our results.
I
> I still think that using a PL in the backend might be more performant
> than having an external client, alone being the SPI interface more
> efficient compared to the network serialization for external applications.
I would actually love for this to work better, as this is technology that I
wou
Hi, Carlo,
Carlo Stonebanks wrote:
>> Trying to achieve a high level of data quality in one large project is
>> not often possible. Focus on the most critical areas of checking and get
>> that working first with acceptable performance, then layer on additional
>> checks while tuning. The complexi
Hi, Carlo,
Carlo Stonebanks wrote:
>> Did you think about putting the whole data into PostgreSQL using COPY in
>> a nearly unprocessed manner, index it properly, and then use SQL and
>> stored functions to transform the data inside the database to the
>> desired result?
>
> This is actually what
> 1. fork your import somhow to get all 4 cores running
This is already happening, albeit only 3. No improvement - it appears we
have taken the same problem, and divided it by 3. Same projected completion
time. this is really curious, to say the least.
> 2. write the code that actually does the
> My experience with that type of load process is that doing this
> row-by-row is a very expensive approach and your results bear that out.
I expected this, and had warned the client before the project started that
this is exactly where SQL underperforms.
> It is often better to write each step
> Did you think about putting the whole data into PostgreSQL using COPY in
> a nearly unprocessed manner, index it properly, and then use SQL and
> stored functions to transform the data inside the database to the
> desired result?
This is actually what we are doing. The slowness is on the row-by-
I have loaded three of the four cores by running three different versions of
the import program to import three different segments of the table to
import. The server jumps to 75% usage, with three postgresql processes
eating up 25% each., the actual client itself taking up just a few ticks.
"He
On 9/29/06, Carlo Stonebanks <[EMAIL PROTECTED]> wrote:
For reasons I've exlained elsewhere, the import process is not well suited
to breaking up the data into smaller segments. However, I'm interested in
what can be indexed. I am used to the idea that indexing only applies to
expressions that al
In response to "Carlo Stonebanks" <[EMAIL PROTECTED]>:
> >> indexes. I don't know whether autovacuum will also analyze tables
> >> for you automagically, but it would be a good idea to analyze the table
> >
> > It does.
>
> So, I have checked my log and I see an autovacuum running once every minu
Hi, Carlo,
Carlo Stonebanks wrote:
> From what I can see, autovacuum is hitting the db's in question about once
> every five minutes. Does this imply an ANALYZE is being done automatically
> that would meet the requirements we are talking about here? Is there any
> benefit ot explicitly perfor
Hi, Carlo,
Carlo Stonebanks wrote:
> We urgently need a major performance improvement.
Did you think about putting the whole data into PostgreSQL using COPY in
a nearly unprocessed manner, index it properly, and then use SQL and
stored functions to transform the data inside the database to the
d
On Thu, 2006-09-28 at 12:44 -0400, Carlo Stonebanks wrote:
> At this early stage in the project, we are initializing our portal's
> database with millions of rows of imported data in over 50 different
> flattened tables; each table's structure is unique to the data provider.
> This requires a p
Carlo Stonebanks wrote:
We urgently need a major performance improvement. We are running the
PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor,
dual core
3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what
type) disc
subsystem. Sorry about the long intro, but here
>> indexes. I don't know whether autovacuum will also analyze tables
>> for you automagically, but it would be a good idea to analyze the table
>
> It does.
So, I have checked my log and I see an autovacuum running once every minute
on our various databases being hosted on the server - once every
"Carlo Stonebanks" <[EMAIL PROTECTED]> writes:
> But is there any way to create an index expression that will help with:
> where foo like '%bar%'?
If you are concerned about that, what you are probably really looking
for is full-text indexing. See contrib/tsearch2 for our current best
answer to t
> Don't forget about EXPLAIN ANALYZE, which is related to EXPLAIN but has
> nothing to do with the ANALYZE command.
Ah, hence my confusion. Thanks for the clarification... I never knew about
ANALYZE as a seperate command.
---(end of broadcast)--
> by the way, stats_command_string is a known performance killer that
> iirc was improved in 8.2. just fyi.
This is a handy fact, I will get on this right away.
> bulk load denomalized tables into scratch tables into the postgresql
> database. create indexes appropriate to the nomalization proces
> imo, the key to high performance big data movements in postgresql is
> mastering sql and pl/pgsql, especially the latter. once you get good
> at it, your net time of copy+plpgsql is going to be less than
> insert+tcl.
If this implies bulk inserts, I'm afraid I have to consider something else.
> Something else to consider... databases love doing bulk operations. It
> might be useful to load prospective data into a temporary table, and
> then do as many operations as you can locally (ie: within the database)
> on that table, hopefully eleminating as many candidate rows as possible
> along
Carlo Stonebanks wrote:
> Lots of great info here, I will see what applies to my situation. However, I
> don't see bulk inserts of the tables working, because all of the tables need
> to be refreshed as values to deduplicate and match will change with every
> row added. In order for this to work
On Thu, Sep 28, 2006 at 02:04:21PM -0700, Steve Atkins wrote:
> I think you're confusing "explain" and "analyze". "Explain" gives you
> human readable output as to what the planner decided to do with the
> query you give it.
Don't forget about EXPLAIN ANALYZE, which is related to EXPLAIN but has
On Sep 28, 2006, at 12:10 PM, Carlo Stonebanks wrote:
Are you wrapping all this in a transaction?
Yes, the transactions can typically wrap 1 to 10 single-table,
single-row
inserts and updates.
You're doing some dynamically generated selects as part of the
"de-duping" process? They're
Lots of great info here, I will see what applies to my situation. However, I
don't see bulk inserts of the tables working, because all of the tables need
to be refreshed as values to deduplicate and match will change with every
row added. In order for this to work, i would have to write queries
On 9/28/06, Carlo Stonebanks <[EMAIL PROTECTED]> wrote:
The deduplication process requires so many programmed procedures that it
runs on the client. Most of the de-dupe lookups are not "straight" lookups,
but calculated ones emplying fuzzy logic. This is because we cannot dictate
the format of ou
The deduplication process requires so many programmed procedures that it
runs on the client. Most of the de-dupe lookups are not "straight" lookups,
but calculated ones emplying fuzzy logic. This is because we cannot dictate
the format of our input data and must deduplicate with what we get.
Th
On 9/28/06, Carlo Stonebanks <[EMAIL PROTECTED]> wrote:
> are you using the 'copy' interface?
Straightforward inserts - the import data has to transformed, normalised and
de-duped by the import program. I imagine the copy interface is for more
straightforward data importing. These are - buy nece
> So your program first transforms the data and then inserts it? And it is
> the transforming process which is running select statements that is slow?
There are cross-referencing and deduplication processes. Does this person
have an office at this exact address? In a similarily named building in
> Are you wrapping all this in a transaction?
Yes, the transactions can typically wrap 1 to 10 single-table, single-row
inserts and updates.
> You're doing some dynamically generated selects as part of the
> "de-duping" process? They're probably the expensive bit. What
> do those queries tend
Carlo Stonebanks wrote:
>> are you using the 'copy' interface?
>
> Straightforward inserts - the import data has to transformed, normalised and
> de-duped by the import program. I imagine the copy interface is for more
> straightforward data importing. These are - buy necessity - single row
> i
Hi All,
I have a Java application using hiber nate that connects to PostgreSQl 8.1.4.
So, looking forward the log files I got the following error:
2006-09-28 09:24:25 LOG: unexpected EOF on client connection
2006-09-28 09:26:06 LOG: unexpected EOF on client connection
2006-09-28 09:48:24 LOG:
On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote:
> > are you using the 'copy' interface?
>
> Straightforward inserts - the import data has to transformed, normalised and
> de-duped by the import program. I imagine the copy interface is for more
> straightforward data importing.
On Thu, Sep 28, 2006 at 01:47:44PM -0400, Carlo Stonebanks wrote:
> > How are you loading the tables? Copy? Insert?
>
> Once the data is transformed, it is inserted. I don't have stats, but the
> programs visual feedback does not spend a lot of time on the "inserting
> data" message. Then again,
On Thu, Sep 28, 2006 at 10:11:31AM -0700, Joshua D. Drake wrote:
> > 4) Can anyone recommend any commercial PostgreSQL service providers that
> > may
> > be able to swiftly come in and assist us with our performance issues?
>
> http://www.commandprompt.com/ (disclaimer, I am an employee)
You f
> -Original Message-
> From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Carlo
Stonebanks
> Subject: [PERFORM] Performace Optimization for Dummies
>
> At this early stage in the project, we are initializing our portal's
> database with millions of r
On Sep 28, 2006, at 10:53 AM, Carlo Stonebanks wrote:
are you using the 'copy' interface?
Straightforward inserts - the import data has to transformed,
normalised and
de-duped by the import program. I imagine the copy interface is for
more
straightforward data importing. These are - buy
> are you using the 'copy' interface?
Straightforward inserts - the import data has to transformed, normalised and
de-duped by the import program. I imagine the copy interface is for more
straightforward data importing. These are - buy necessity - single row
inserts.
> thats a tough question.
> How are you loading the tables? Copy? Insert?
Once the data is transformed, it is inserted. I don't have stats, but the
programs visual feedback does not spend a lot of time on the "inserting
data" message. Then again, if there is an asynchronous component to an
insert, perhaps I am not seein
On 9/28/06, Carlo Stonebanks <[EMAIL PROTECTED]> wrote:
We urgently need a major performance improvement. We are running the
PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core
3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc
subsystem. Sorry a
> The import is slow - and degrades as the tables grow. With even more
> millions of rows in dozens of import tables to come, the imports will take
> forever. My ability to analyse the queries is limited; because of the nature
> of the import process, the SQL queries are mutable, every imported
I am a software developer who is acting in a (temporary) dba role for a
project. I had recommended PostgreSQL be brought in to replace the proposed
MySQL DB - I chose PostgreSQL because of its reputation as a more stable
solution than MySQL.
At this early stage in the project, we are initializi
41 matches
Mail list logo