Re: performance with a "large" number of supercolumns/columns

Jonathan Ellis Wed, 07 Jul 2010 16:08:09 -0700

Hi Terje,

Sorry to not get to this sooner.  Are you still looking into this?


On Tue, Jun 22, 2010 at 12:47 PM, Terje Marthinussen
<tmarthinus...@gmail.com> wrote:
> Hi,
>
> I was looking a bit on a case we have with columnfamily which has 439k
> supercolumns, each supercolumn with ~3 columns each so a total of some 1.3
> million objects in total.
>
> This takes about 9 second to read with thrift on first access, 4-5 second on
> second access.
>
> I took a little closer look at this, I noticed that roughly half of this
> time was spend in cassandra.
>
> I will look more at this, but I thought I would just ask people here as it
> could be that someone already had good explanations...
>
> Most of the time is spent here
>
>    public void collectReducedColumns(IColumnContainer container,
> Iterator<IColumn> reducedColumns, int gcBefore)
>    {
>        int liveColumns = 0;
>        AbstractType comparator = container.getComparator();
>
>        while (reducedColumns.hasNext())
>        {
>            if (liveColumns >= count)
>                break;
>
>            IColumn column = reducedColumns.next();
>            if (logger.isDebugEnabled())
>                logger.debug(String.format("collecting %s of %s: %s",
>                                           liveColumns, count,
> column.getString(comparator)));
>
>            if (finish.length > 0
>                && ((!reversed && comparator.compare(column.name(),
> finish) > 0))
>                    || (reversed && comparator.compare(column.name(),
> finish) < 0))
>                break;
>
>            // only count live columns towards the `count` criteria
>            if (!column.isMarkedForDelete()
>                && (!container.isMarkedForDelete()
>                    || (ClockRelationship.GREATER_THAN ==
> column.mostRecentLiveChangeAt().compare(container.getMarkedForDeleteAt()))))
>            {
>                liveColumns++;
>            }
>
>            // but we need to add all non-gc-able columns to the
> result for read repair:
>            if (QueryFilter.isRelevant(column, container, gcBefore))
>                container.addColumn(column);
>        }
>    }
>
> Adding some time measuring print revealed a few interesting this.
> 1. First time I request the row (I request the entire row in this case),
> collectReducedColumns() is called twice. I have not had time to understand
> why yet, but there is one difference between the calls. All columns are
> returned both times, but the first call is done with MAXINT as "count" while
> the second call has the maxcount actually requested by the client as
> "count". The first call takes about 3.7 seconds to process, the second about
> 1.7 secs. Whatever reason, I think we should be able to remove one of these
> calls?
>
> 2. Almost half the time is spent in "container.addColumn". This is probably
> no big surprise as there is a lot of code hidden "down there". I am however
> very curious if it could not be significantly optimized, especially in the
> case where you have predicates which cases all columns to be included. That
> said, I have not manage to read enough cassandra code to understand
> tombstones or all the other things which is going on in that part of the
> code.
>
> 3. A bit more of a surprise, most of the remaining 50% of the time seems to
> occur at while (reducedColumns.hasNext())
> That is, save system.nanoTime() at the end of the loop and after hasNext and
> sum up and it accounts for almost 50% of the total time. That seems quite
> weird to me.
>
> I will dig more, but I was curious if someone had answers already.
>
> Best regards,
> Terje
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance with a "large" number of supercolumns/columns

Reply via email to