On Thu, Oct 14, 2010 at 6:41 PM, Ryan Rawson <[email protected]> wrote:

> If you have a single row that approaches then exceeds the size of a
> region, eventually you will end up having that row as a single region,
> with the region encompassing only that one region.
>
> The reason for HBase and bigtable is that the overhead that HDFS
> has... every file in HDFS uses a size of RAM that is not dependent on
> the size of the file.  Meaning the more files you have, that are
> small, you use more and more RAM and run out of namenode scalability.
> So HBase exists to store smaller values. There is some overhead. Thus
> once you start putting in larger values, you might as well avoid the
> overhead and go straight to/from HDFS.


While, for the scenario that I listed above: millions of small key-value
pairs that end up with exceed 256MB, storing these key-value pairs directly
into a file in HDFS would not be an option. If we do so, we end up scan
throught the whole file; and if we store them into HBase, we are going to
leverage the information of the index.






>
> -ryan
>
>
> On Thu, Oct 14, 2010 at 5:23 PM, Sean Bigdatafun
> <[email protected]> wrote:
> > Let me ask this question from another angle:
> >
> > The first question is ---
> > if I have millions of column in a column family in the same row, such
> that
> > the sum of the key-value pairs exceeds 256MB, what will happen?
> >
> > example:
> > I have a column with key of 256bytes, and the value of 2K, then let's
> assume
> > (256 + timestampe size + 2056) ~=2.5k,
> > then I understand I can at most story 256 * 1024 / 2.5 = 104,875 columns
> in
> > this column family at this row.
> >
> > Anyone has comments on the math I gave above?
> >
> >
> > The second question is --
> > By the way, if I do not turn on the LZO, is my data also compressed (by
> the
> > system)? -- if so, then the above number will increase a couple of times,
> > but still there exists a number for the limit of how many columns I can
> put
> > in a row.
> >
> > The third question is --
> > If I do turn on LZO, does that mean the value get compressed first, and
> then
> > the HBase mechanism further compress the key-value pair?
> >
> > Thanks,
> > Sean
> >
> >
> > On Tue, Sep 7, 2010 at 8:30 PM, Jonathan Gray <[email protected]>
> wrote:
> >
> >> You can go way beyond the max region split / split size.  HBase will
> never
> >> split the region once it is a single row, even if beyond the split size.
> >>
> >> Also, if you're using large values, you should have region sizes much
> >> larger than the default.  It's common to run with 1-2GB regions in many
> >> cases.
> >>
> >> What you may have seen are recommendations that if your cell values are
> >> approaching the default block size on HDFS (64MB), you should consider
> >> putting the data directly into HDFS rather than HBase.
> >>
> >> JG
> >>
> >> > -----Original Message-----
> >> > From: William Kang [mailto:[email protected]]
> >>  > Sent: Tuesday, September 07, 2010 7:36 PM
> >> > To: [email protected]; [email protected]
> >> > Subject: Re: Limits on HBase
> >> >
> >> > Hi,
> >> > Thanks for your reply. How about the row size? I read that a row
> should
> >> > not
> >> > be larger than the hdfs file on region server which is 256M in
> default.
> >> > Is
> >> > it right? Many thanks.
> >> >
> >> >
> >> > William
> >> >
> >> > On Tue, Sep 7, 2010 at 2:22 PM, Andrew Purtell <[email protected]>
> >> > wrote:
> >> >
> >> > > In addition to what Jon said please be aware that if compression is
> >> > > specified in the table schema, it happens at the store file level --
> >> > > compression happens after write I/O, before read I/O, so if you
> >> > transmit a
> >> > > 100MB object that compresses to 30MB, the performance impact is that
> >> > of
> >> > > 100MB, not 30MB.
> >> > >
> >> > > I also try not to go above 50MB as largest cell size, for the same
> >> > reason.
> >> > > I have tried storing objects larger than 100MB but this can cause
> out
> >> > of
> >> > > memory issues on busy regionservers no matter the size of the heap.
> >> > When/if
> >> > > HBase RPC can send large objects in smaller chunks, this will be
> less
> >> > of an
> >> > > issue.
> >> > >
> >> > > Best regards,
> >> > >
> >> > >    - Andy
> >> > >
> >> > > Why is this email five sentences or less?
> >> > > http://five.sentenc.es/
> >> > >
> >> > >
> >> > > --- On Mon, 9/6/10, Jonathan Gray <[email protected]> wrote:
> >> > >
> >> > > > From: Jonathan Gray <[email protected]>
> >> > > > Subject: RE: Limits on HBase
> >> > > > To: "[email protected]" <[email protected]>
> >> > > > Date: Monday, September 6, 2010, 4:10 PM
> >> > > > I'm not sure what you mean by
> >> > > > "optimized cell size" or whether you're just asking about
> >> > > > practical limits?
> >> > > >
> >> > > > HBase is generally used with cells in the range of tens of
> >> > > > bytes to hundreds of kilobytes.  However, I have used
> >> > > > it with cells that are several megabytes, up to about
> >> > > > 50MB.  Up at that level, I have seen some weird
> >> > > > performance issues.
> >> > > >
> >> > > > The most important thing is to be sure to tweak all of your
> >> > > > settings.  If you have 20MB cells, you need to be sure
> >> > > > to increase the flush size beyond 64MB and the split size
> >> > > > beyond 256MB.  You also need enough memory to support
> >> > > > all this large object allocation.
> >> > > >
> >> > > > And of course, test test test.  That's the easiest way
> >> > > > to see if what you want to do will work :)
> >> > > >
> >> > > > When you run into problems, e-mail the list.
> >> > > >
> >> > > > As far as row size is concerned, the only issue is that a
> >> > > > row can never span multiple regions so a given row can only
> >> > > > be in one region and thus be hosted on one server at a
> >> > > > time.
> >> > > >
> >> > > > JG
> >> > > >
> >> > > > > -----Original Message-----
> >> > > > > From: William Kang [mailto:[email protected]]
> >> > > > > Sent: Monday, September 06, 2010 1:57 PM
> >> > > > > To: hbase-user
> >> > > > > Subject: Limits on HBase
> >> > > > >
> >> > > > > Hi folks,
> >> > > > > I know this question may have been asked many times,
> >> > > > but I am wondering
> >> > > > > if
> >> > > > > there is any update on the optimized cell size (in
> >> > > > megabytes) and row
> >> > > > > size
> >> > > > > (in megabytes)? Many thanks.
> >> > > > >
> >> > > > >
> >> > > > > William
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >>
> >
>

Reply via email to