On Thu, Oct 14, 2010 at 6:41 PM, Ryan Rawson <[email protected]> wrote:
> If you have a single row that approaches then exceeds the size of a > region, eventually you will end up having that row as a single region, > with the region encompassing only that one region. > > The reason for HBase and bigtable is that the overhead that HDFS > has... every file in HDFS uses a size of RAM that is not dependent on > the size of the file. Meaning the more files you have, that are > small, you use more and more RAM and run out of namenode scalability. > So HBase exists to store smaller values. There is some overhead. Thus > once you start putting in larger values, you might as well avoid the > overhead and go straight to/from HDFS. While, for the scenario that I listed above: millions of small key-value pairs that end up with exceed 256MB, storing these key-value pairs directly into a file in HDFS would not be an option. If we do so, we end up scan throught the whole file; and if we store them into HBase, we are going to leverage the information of the index. > > -ryan > > > On Thu, Oct 14, 2010 at 5:23 PM, Sean Bigdatafun > <[email protected]> wrote: > > Let me ask this question from another angle: > > > > The first question is --- > > if I have millions of column in a column family in the same row, such > that > > the sum of the key-value pairs exceeds 256MB, what will happen? > > > > example: > > I have a column with key of 256bytes, and the value of 2K, then let's > assume > > (256 + timestampe size + 2056) ~=2.5k, > > then I understand I can at most story 256 * 1024 / 2.5 = 104,875 columns > in > > this column family at this row. > > > > Anyone has comments on the math I gave above? > > > > > > The second question is -- > > By the way, if I do not turn on the LZO, is my data also compressed (by > the > > system)? -- if so, then the above number will increase a couple of times, > > but still there exists a number for the limit of how many columns I can > put > > in a row. > > > > The third question is -- > > If I do turn on LZO, does that mean the value get compressed first, and > then > > the HBase mechanism further compress the key-value pair? > > > > Thanks, > > Sean > > > > > > On Tue, Sep 7, 2010 at 8:30 PM, Jonathan Gray <[email protected]> > wrote: > > > >> You can go way beyond the max region split / split size. HBase will > never > >> split the region once it is a single row, even if beyond the split size. > >> > >> Also, if you're using large values, you should have region sizes much > >> larger than the default. It's common to run with 1-2GB regions in many > >> cases. > >> > >> What you may have seen are recommendations that if your cell values are > >> approaching the default block size on HDFS (64MB), you should consider > >> putting the data directly into HDFS rather than HBase. > >> > >> JG > >> > >> > -----Original Message----- > >> > From: William Kang [mailto:[email protected]] > >> > Sent: Tuesday, September 07, 2010 7:36 PM > >> > To: [email protected]; [email protected] > >> > Subject: Re: Limits on HBase > >> > > >> > Hi, > >> > Thanks for your reply. How about the row size? I read that a row > should > >> > not > >> > be larger than the hdfs file on region server which is 256M in > default. > >> > Is > >> > it right? Many thanks. > >> > > >> > > >> > William > >> > > >> > On Tue, Sep 7, 2010 at 2:22 PM, Andrew Purtell <[email protected]> > >> > wrote: > >> > > >> > > In addition to what Jon said please be aware that if compression is > >> > > specified in the table schema, it happens at the store file level -- > >> > > compression happens after write I/O, before read I/O, so if you > >> > transmit a > >> > > 100MB object that compresses to 30MB, the performance impact is that > >> > of > >> > > 100MB, not 30MB. > >> > > > >> > > I also try not to go above 50MB as largest cell size, for the same > >> > reason. > >> > > I have tried storing objects larger than 100MB but this can cause > out > >> > of > >> > > memory issues on busy regionservers no matter the size of the heap. > >> > When/if > >> > > HBase RPC can send large objects in smaller chunks, this will be > less > >> > of an > >> > > issue. > >> > > > >> > > Best regards, > >> > > > >> > > - Andy > >> > > > >> > > Why is this email five sentences or less? > >> > > http://five.sentenc.es/ > >> > > > >> > > > >> > > --- On Mon, 9/6/10, Jonathan Gray <[email protected]> wrote: > >> > > > >> > > > From: Jonathan Gray <[email protected]> > >> > > > Subject: RE: Limits on HBase > >> > > > To: "[email protected]" <[email protected]> > >> > > > Date: Monday, September 6, 2010, 4:10 PM > >> > > > I'm not sure what you mean by > >> > > > "optimized cell size" or whether you're just asking about > >> > > > practical limits? > >> > > > > >> > > > HBase is generally used with cells in the range of tens of > >> > > > bytes to hundreds of kilobytes. However, I have used > >> > > > it with cells that are several megabytes, up to about > >> > > > 50MB. Up at that level, I have seen some weird > >> > > > performance issues. > >> > > > > >> > > > The most important thing is to be sure to tweak all of your > >> > > > settings. If you have 20MB cells, you need to be sure > >> > > > to increase the flush size beyond 64MB and the split size > >> > > > beyond 256MB. You also need enough memory to support > >> > > > all this large object allocation. > >> > > > > >> > > > And of course, test test test. That's the easiest way > >> > > > to see if what you want to do will work :) > >> > > > > >> > > > When you run into problems, e-mail the list. > >> > > > > >> > > > As far as row size is concerned, the only issue is that a > >> > > > row can never span multiple regions so a given row can only > >> > > > be in one region and thus be hosted on one server at a > >> > > > time. > >> > > > > >> > > > JG > >> > > > > >> > > > > -----Original Message----- > >> > > > > From: William Kang [mailto:[email protected]] > >> > > > > Sent: Monday, September 06, 2010 1:57 PM > >> > > > > To: hbase-user > >> > > > > Subject: Limits on HBase > >> > > > > > >> > > > > Hi folks, > >> > > > > I know this question may have been asked many times, > >> > > > but I am wondering > >> > > > > if > >> > > > > there is any update on the optimized cell size (in > >> > > > megabytes) and row > >> > > > > size > >> > > > > (in megabytes)? Many thanks. > >> > > > > > >> > > > > > >> > > > > William > >> > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > >
