[PERFORM] schema design question

2007-08-19 Thread mark overmeer
 Hi,

Maybe not completely the wright place to ask but... I have this schema
design question (db is postgres of course). I have a couple of classes with
attributes. The only goal is to search the object that I want to find (which
is stored on the harddrive).

I have hundreds of classes that are similar but not the same. They all have
attributes/properties (type is probably String), e.g. (in pseudo code):

class A_version_1 {
   attribute1, attribute2, attribute3, ..., attributeN
}

class A_version_2 {
   attribute1, attribute3, ..., attributeN, attributeN+1, attributeN+2
}

class B_version_1 {
   attribute3, attribute4, attribute7, attributeN+3, ..., attributeN+M
}


Class A will have attributes from class B, class B will have attributes from
class C and so on. My initial thought was to use the (sometimes dreaded) EAV
model: class_id, object_id, attribute_id and attribute_value. In this way I
can make queries like:

SELECT CLASS_ID,
   OBJECT_ID
FROM   EAV_TABLE EAV
WHERE  EAV.ATTRIBUTE_ID = X
   AND EAV.ATTRIBUTE_VALUE = 'searchstring'
   AND EXISTS (SELECT OBJECT_ID
   FROM   EAV_TABLE EAV2
   WHERE  EAV.OBJECT_ID = EAV2.OBJECT_ID
  AND EAV.CLASS_ID = EAV2.CLASS_ID
  AND EAV2.ATTRIBUTE_ID = Y
  AND EAV2.ATTRIBUTE_VALUE = 'searchstring2')

Results from this query could be entities from multiple classes!

The alternative is, as many people say: make a proper table for each class
which would lead to hundreds of unions. Is that good/performant? I thought
it would not...  To put all attributes of all classes (as columns) in one
table is impossible. The number of total attributes  should be in the
thousands.

A third alternative I came up with is the entity/value schema design where
each attribute would have its own table. A query would look like this:

SELECT CLASS_ID,
   OBJECT_ID
FROM   EV_X EAV
WHERE  EAV.ATTRIBUTE_VALUE = 'searchstring'
   AND EXISTS (SELECT OBJECT_ID
   FROM   EV_Y EAV2
   WHERE  EAV.OBJECT_ID = EAV2.OBJECT_ID
  AND EAV.CLASS_ID = EAV2.CLASS_ID
  AND EAV2.ATTRIBUTE_VALUE = 'searchstring2')

Which would be a nice way to partition the otherwise large table (but there
would be thousands of smaller tables).

The app I'm writing has to scale to about 1 billion attributes/value-pairs
in total. A normal search query would imply about 5 search terms (but there
could be 20). Any suggestions/remarks (I think the EXISTS should be replaced
by an IN, something else)? Did anyone implement such a search method (or did
they decide to make a different design)? Did it work/scale?

Thanks in advance,

Mark O.


Re: [PERFORM] schema design question

2007-08-19 Thread Adam Tauno Williams

> Maybe not completely the wright place to ask but... I have this schema
> design question (db is postgres of course). I have a couple of classes
> with attributes. The only goal is to search the object that I want to
> find (which is stored on the harddrive). 
> I have hundreds of classes that are similar but not the same. They all
> have attributes/properties (type is probably String), e.g. (in pseudo
> code):

Use table inheritance.

-- 
Adam Tauno Williams, Network & Systems Administrator
Consultant - http://www.whitemiceconsulting.com
Developer - http://www.opengroupware.org


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] schema design question

2007-08-19 Thread mark overmeer
Hi Adam,

Thanks for the fast reply. What should inherit from what? Class A (e.g.
'todo item') is certainly not derived from property X (e.g. 'startdate').
Class A version 2 has different properties (some are removed, others are
added). Can you elaborate / say I'm wrong / give an example ? Thanks,

Mark


2007/8/19, Adam Tauno Williams <[EMAIL PROTECTED]>:
>
>
> > Maybe not completely the wright place to ask but... I have this schema
> > design question (db is postgres of course). I have a couple of classes
> > with attributes. The only goal is to search the object that I want to
> > find (which is stored on the harddrive).
> > I have hundreds of classes that are similar but not the same. They all
> > have attributes/properties (type is probably String), e.g. (in pseudo
> > code):
>
> Use table inheritance.
>
> --
> Adam Tauno Williams, Network & Systems Administrator
> Consultant - http://www.whitemiceconsulting.com
> Developer - http://www.opengroupware.org
>
>
> ---(end of broadcast)---
> TIP 5: don't forget to increase your free space map settings
>


Re: [PERFORM] schema design question

2007-08-19 Thread David Fetter
On Sun, Aug 19, 2007 at 03:19:52PM +0200, mark overmeer wrote:
>  Hi,
> 
> Maybe not completely the wright place to ask but... I have this
> schema design question (db is postgres of course). I have a couple
> of classes with attributes.

Danger, Will Robinson!  Danger!

The DBMS way of looking at things is fundamentally different from OO
coding, and if you try to make them fit together naïvely as you do
below, you only get grief.

> The only goal is to search the object
> that I want to find (which is stored on the harddrive).
> 
> I have hundreds of classes that are similar but not the same. They all have
> attributes/properties (type is probably String), e.g. (in pseudo code):
> 
> class A_version_1 {
>attribute1, attribute2, attribute3, ..., attributeN
> }
> 
> class A_version_2 {
>attribute1, attribute3, ..., attributeN, attributeN+1, attributeN+2
> }
> 
> class B_version_1 {
>attribute3, attribute4, attribute7, attributeN+3, ..., attributeN+M
> }
> 
> 
> Class A will have attributes from class B, class B will have
> attributes from class C and so on. My initial thought was to use the
> (sometimes dreaded) EAV model: class_id, object_id, attribute_id and
> attribute_value. In this way I can make queries like:
> 
> SELECT CLASS_ID,
>OBJECT_ID
> FROM   EAV_TABLE EAV

There's your mistake.  EAV is not performant, and won't become so.

Decide what your database will and won't do, and design your schema
around that.  I know it takes a little extra helping of courage, but
it's worth it in the long run.

Cheers,
David.
-- 
David Fetter <[EMAIL PROTECTED]> http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] schema design question

2007-08-19 Thread Steinar H. Gunderson
On Sun, Aug 19, 2007 at 11:12:16AM -0700, David Fetter wrote:
> There's your mistake.  EAV is not performant, and won't become so.

It sort of depends. I put all the EXIF information for my image gallery into
an EAV table -- it was the most logical format at the time, although I'm not
sure I need all the information. Anyhow, with clustering and indexes,
Postgres zips through the five million records easily enough for my use -- at
least fast enough that I can live with it without feeling the need for a
redesign.

As a general database design paradigm, though, I fully agree with you.
Databases are databases, not glorified OO data stores or hash tables.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] schema design question

2007-08-19 Thread David Fetter
On Sun, Aug 19, 2007 at 08:26:58PM +0200, Steinar H. Gunderson wrote:
> On Sun, Aug 19, 2007 at 11:12:16AM -0700, David Fetter wrote:
> > There's your mistake.  EAV is not performant, and won't become so.
> 
> It sort of depends. I put all the EXIF information for my image
> gallery into an EAV table -- it was the most logical format at the
> time, although I'm not sure I need all the information. Anyhow, with
> clustering and indexes, Postgres zips through the five million
> records easily enough for my use -- at least fast enough that I can
> live with it without feeling the need for a redesign.

Unless your records are huge, that's a tiny database, where tiny is
defined to mean that the whole thing fits in main memory with plenty
of room to spare.  I guarantee that performance will crash right
through the floor as soon as any table no longer fits in main memory.

> As a general database design paradigm, though, I fully agree with
> you.  Databases are databases, not glorified OO data stores or hash
> tables.

Exactly :)

Cheers,
David.
-- 
David Fetter <[EMAIL PROTECTED]> http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] [pgsql-jobs] Looking for database hosting

2007-08-19 Thread Andrew Hammond
Nik, you may be underestimating just how much performance can be obtained
from a single database server. For example, an IBM p595 server connected to
an array of ds8300 storage devices could reasonably be expected to provide
several orders of magnitude more performance when compared to commodity
hardware. In commodity space (albeit, just barely), a 16 core opteron
running (the admittedly yet-to-be-released) FreeBSD 7, and a suitably
provisioned SAN should also enormously outperform a beige-box solution, and
at a fraction of the cost. If it's performance you care about then the
pgsql-performance list (which I have cc'd) is the place to talk about it.

I realize this doesn't address your desire to get out of database server
administration. I am not aware of any company which provides database
hosting, further I'm not entirely convinced that's a viable business
solution. The technical issues (security, latency and reliability are the
ones that immediately come to mind) associated with a hosted database server
solution suggest to me that this would not be economically viable. The
business issues around out-sourcing a critical, if not central component of
your architecture seem, at least to me, to be insurmountable.

Andrew


On 8/19/07, Niklas Saers <[EMAIL PROTECTED]> wrote:
>
> Hi,
> the company I'm doing work for is expecting a 20 times increase in
> data and seeks a 10 times increase in performance. Having pushed our
> database server to the limit daily for the past few months we have
> decided we'd prefer to be database users rather than database server
> admins. :-)
>
> Are you or can you recommend a database hosting company that is good
> for clients that require more power than what a single database
> server can offer?
>
> Cheers
>
> Nik
>
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>choose an index scan if your joining column's datatypes do not
>match
>


Re: [PERFORM] schema design question

2007-08-19 Thread mark overmeer
Hi,

2007/8/19, Steinar H. Gunderson <[EMAIL PROTECTED]>:
>
> As a general database design paradigm, though, I fully agree with you.
> Databases are databases, not glorified OO data stores or hash tables.

I don't want to use it as an OO data store, I use the filesystem for that.
The intended use is to search for the right object. Since it has separate
data structures for searching (indexes) I guess that is one of its
functions.

However, it still doesn't answer my question about the EV model (where each
attribute is given its own table).

Mark

/* Steinar */
> --
> Homepage: http://www.sesse.net/
>


Re: [PERFORM] [pgsql-jobs] Looking for database hosting

2007-08-19 Thread Luke Lonergan
Andrew,

I'd say that commodity systems are the fastest with postgres - many have seen 
big slowdowns with high end servers.  'Several orders of magnitude' is not 
possible by just changing the HW, you've got a SW problem to solve first.  We 
have done 100+ times faster than both Postgres and popular (even gridded) 
commercial DBMS using an intrinsically parallel SW approach.

If the objective is OLAP / DSS there's no substitute for a parallel DB that 
does query and load / transform using all the CPUs and IO channels 
simultaneously.  This role is best met from a value standpoint by clustering 
commodity systems.

For OLTP, we need better SMP and DML algorithmic optimizations for concurrency, 
at which point big SMP machines work.  Right now you can buy a 32 CPU commodity 
(opteron) machine from SUN (X4600) for about $60K loaded.

WRT hosting, we've done a bit of it on GPDB systems, but we're not making it a 
focus area.  Instead, we do subscription pricing by the amount of data used and 
recommend / help get systems set up.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Andrew Hammond [mailto:[EMAIL PROTECTED]
Sent:   Sunday, August 19, 2007 03:49 PM Eastern Standard Time
To: Niklas Saers
Cc: [EMAIL PROTECTED]; pgsql-performance@postgresql.org
Subject:Re: [PERFORM] [pgsql-jobs] Looking for database hosting

Nik, you may be underestimating just how much performance can be obtained
from a single database server. For example, an IBM p595 server connected to
an array of ds8300 storage devices could reasonably be expected to provide
several orders of magnitude more performance when compared to commodity
hardware. In commodity space (albeit, just barely), a 16 core opteron
running (the admittedly yet-to-be-released) FreeBSD 7, and a suitably
provisioned SAN should also enormously outperform a beige-box solution, and
at a fraction of the cost. If it's performance you care about then the
pgsql-performance list (which I have cc'd) is the place to talk about it.

I realize this doesn't address your desire to get out of database server
administration. I am not aware of any company which provides database
hosting, further I'm not entirely convinced that's a viable business
solution. The technical issues (security, latency and reliability are the
ones that immediately come to mind) associated with a hosted database server
solution suggest to me that this would not be economically viable. The
business issues around out-sourcing a critical, if not central component of
your architecture seem, at least to me, to be insurmountable.

Andrew


On 8/19/07, Niklas Saers <[EMAIL PROTECTED]> wrote:
>
> Hi,
> the company I'm doing work for is expecting a 20 times increase in
> data and seeks a 10 times increase in performance. Having pushed our
> database server to the limit daily for the past few months we have
> decided we'd prefer to be database users rather than database server
> admins. :-)
>
> Are you or can you recommend a database hosting company that is good
> for clients that require more power than what a single database
> server can offer?
>
> Cheers
>
> Nik
>
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>choose an index scan if your joining column's datatypes do not
>match
>


Re: [PERFORM] schema design question

2007-08-19 Thread David Fetter
On Sun, Aug 19, 2007 at 10:13:08PM +0200, mark overmeer wrote:
> Hi,
> 
> 2007/8/19, Steinar H. Gunderson <[EMAIL PROTECTED]>:
> >
> > As a general database design paradigm, though, I fully agree with
> > you.  Databases are databases, not glorified OO data stores or
> > hash tables.
> 
> I don't want to use it as an OO data store, I use the filesystem for
> that.  The intended use is to search for the right object. Since it
> has separate data structures for searching (indexes) I guess that is
> one of its functions.
> 
> However, it still doesn't answer my question about the EV model
> (where each attribute is given its own table).

The answer to EAV modeling, is, "DON'T!"

Cheers,
David (who, if he were greedy, would be encouraging EAV modeling
because it would cause guaranteed large consulting income later)
-- 
David Fetter <[EMAIL PROTECTED]> http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] schema design question

2007-08-19 Thread Steinar H. Gunderson
On Sun, Aug 19, 2007 at 11:41:15AM -0700, David Fetter wrote:
> Unless your records are huge, that's a tiny database, where tiny is
> defined to mean that the whole thing fits in main memory with plenty
> of room to spare.  I guarantee that performance will crash right
> through the floor as soon as any table no longer fits in main memory.

Sure, it fits into memory; however, it isn't used so often, though, so it's
frequently not in the cache when it's needed. You are completely right in
that it's much slower from disk than from RAM :-)

The question is, of course, how to best store something like the EXIF
information _without_ using EAV. I could separate out the few fields I
normally use into a horizontal (ie. standard relational) table, but it seems
sort of... lossy? Another possible approach is to keep the EAV table around
for completeness in addition to the few fields I need, but then you do of
course get into normalization issues.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Looking for database hosting: FIX CC LIST!!

2007-08-19 Thread Josh Berkus
Folks,

Please remove pgsql-jobs from your CC list with this thread. That list is 
ONLY for employment ads.  Thank you.

> Nik, you may be underestimating just how much performance can be
> obtained from a single database server ...

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] [pgsql-jobs] Looking for database hosting

2007-08-19 Thread Leon Mergen
Hello,

Just to note something interesting on database scalability: i'm not sure
whether your database is used for processing or just data lookup, but if
it's used for data lookup, look into memcached -- it's a really scalable
caching system which can reduce your database load a lot.

I know a lot of large websites (slashdot, livejournal, etc) use this
solution -- they have dozens of gigabytes worth of memcached processes to
reduce the cache hits (I'm told livejournal has around 200 of those servers
running, making sure around 99.99% of the database queries are just cache
hits). This probably has been discussed on this list before, but just in
case: look into it.

Regards,

Leon Mergen


On 8/19/07, Andrew Hammond <[EMAIL PROTECTED]> wrote:
>
> Nik, you may be underestimating just how much performance can be obtained
> from a single database server. For example, an IBM p595 server connected to
> an array of ds8300 storage devices could reasonably be expected to provide
> several orders of magnitude more performance when compared to commodity
> hardware. In commodity space (albeit, just barely), a 16 core opteron
> running (the admittedly yet-to-be-released) FreeBSD 7, and a suitably
> provisioned SAN should also enormously outperform a beige-box solution, and
> at a fraction of the cost. If it's performance you care about then the
> pgsql-performance list (which I have cc'd) is the place to talk about it.
>
> I realize this doesn't address your desire to get out of database server
> administration. I am not aware of any company which provides database
> hosting, further I'm not entirely convinced that's a viable business
> solution. The technical issues (security, latency and reliability are the
> ones that immediately come to mind) associated with a hosted database server
> solution suggest to me that this would not be economically viable. The
> business issues around out-sourcing a critical, if not central component of
> your architecture seem, at least to me, to be insurmountable.
>
> Andrew
>
>
> On 8/19/07, Niklas Saers <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> > the company I'm doing work for is expecting a 20 times increase in
> > data and seeks a 10 times increase in performance. Having pushed our
> > database server to the limit daily for the past few months we have
> > decided we'd prefer to be database users rather than database server
> > admins. :-)
> >
> > Are you or can you recommend a database hosting company that is good
> > for clients that require more power than what a single database
> > server can offer?
> >
> > Cheers
> >
> > Nik
> >
> > ---(end of broadcast)---
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >choose an index scan if your joining column's datatypes do not
> >match
> >
>
>


-- 
Leon Mergen
http://www.solatis.com


Re: [PERFORM] schema design question

2007-08-19 Thread Adam Tauno Williams


> However, it still doesn't answer my question about the EV model (where
> each attribute is given its own table).

Do a TABLE(object_id INT, attribute STRING, value STRING) if you just
want to be able to search for objects by an attribute.  But better yet
look at one of the thousand object persistence systems out there,  not
much to be gained from re-inventing the wheel. 

-- 
Adam Tauno Williams, Network & Systems Administrator
Consultant - http://www.whitemiceconsulting.com
Developer - http://www.opengroupware.org


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] [pgsql-jobs] Looking for database hosting

2007-08-19 Thread Andrew Hammond
On 8/19/07, Luke Lonergan <[EMAIL PROTECTED]> wrote:
>
>  Andrew,
>
> I'd say that commodity systems are the fastest with postgres - many have
> seen big slowdowns with high end servers.  'Several orders of magnitude' is
> not possible by just changing the HW,
>

Going from one or two SATA disks to a SAN farm ought to achieve orders of
magnitude in improvement. And cost. Going from 2GB of memory up to 16 or
32GB can make significant changes as well. However I agree with you that
intelligence at the application layer such that you can take advantage of a
parallel approach is a superior solution both in terms of overall
effectiveness and cost effectiveness.

you've got a SW problem to solve first.  We have done 100+ times faster than
> both Postgres and popular (even gridded) commercial DBMS using an
> intrinsically parallel SW approach.
>

That is both cool and unsurprising at the same time. One of the major
challenges I've seen in practice is that small companies don't generally
start off with a db design that's capable of a parallel approach. With
success and growth, there comes a point where a massive re-design is needed.
Companies that recognize this, make the investment and take the risk are
rare.

If the objective is OLAP / DSS there's no substitute for a parallel DB that
> does query and load / transform using all the CPUs and IO channels
> simultaneously.  This role is best met from a value standpoint by clustering
> commodity systems.
>
> For OLTP, we need better SMP and DML algorithmic optimizations for
> concurrency, at which point big SMP machines work.  Right now you can buy a
> 32 CPU commodity (opteron) machine from SUN (X4600) for about $60K loaded.
>


WRT hosting, we've done a bit of it on GPDB systems, but we're not making it
> a focus area.  Instead, we do subscription pricing by the amount of data
> used and recommend / help get systems set up.
>
> - Luke
>
> Msg is shrt cuz m on ma treo
>
>
>  -Original Message-
> From:   Andrew Hammond [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]>
> ]
> Sent:   Sunday, August 19, 2007 03:49 PM Eastern Standard Time
> To: Niklas Saers
> Cc: [EMAIL PROTECTED]; pgsql-performance@postgresql.org
> Subject:Re: [PERFORM] [pgsql-jobs] Looking for database hosting
>
> Nik, you may be underestimating just how much performance can be obtained
> from a single database server. For example, an IBM p595 server connected
> to
> an array of ds8300 storage devices could reasonably be expected to provide
> several orders of magnitude more performance when compared to commodity
> hardware. In commodity space (albeit, just barely), a 16 core opteron
> running (the admittedly yet-to-be-released) FreeBSD 7, and a suitably
> provisioned SAN should also enormously outperform a beige-box solution,
> and
> at a fraction of the cost. If it's performance you care about then the
> pgsql-performance list (which I have cc'd) is the place to talk about it.
>
> I realize this doesn't address your desire to get out of database server
> administration. I am not aware of any company which provides database
> hosting, further I'm not entirely convinced that's a viable business
> solution. The technical issues (security, latency and reliability are the
> ones that immediately come to mind) associated with a hosted database
> server
> solution suggest to me that this would not be economically viable. The
> business issues around out-sourcing a critical, if not central component
> of
> your architecture seem, at least to me, to be insurmountable.
>
> Andrew
>
>
> On 8/19/07, Niklas Saers <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> > the company I'm doing work for is expecting a 20 times increase in
> > data and seeks a 10 times increase in performance. Having pushed our
> > database server to the limit daily for the past few months we have
> > decided we'd prefer to be database users rather than database server
> > admins. :-)
> >
> > Are you or can you recommend a database hosting company that is good
> > for clients that require more power than what a single database
> > server can offer?
> >
> > Cheers
> >
> > Nik
> >
> > ---(end of broadcast)---
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >choose an index scan if your joining column's datatypes do not
> >match
> >
>
>