Re: [math] Questions about the linear package

Jake Mannix Wed, 14 Oct 2009 10:47:40 -0700

Hi Luc,


On Wed, Oct 14, 2009 at 3:01 AM, <luc.maison...@free.fr> wrote:

> >
> >   * also for RealVector - No iterator methods?  So if the
> > implementation is
> > sparse, there's no way to just iterate over the non-zero entries?
> > What's
> > worse, you can't even subclass OpenMapVector and expose the iterator
> > on the
> > OpenIntToDoubleHashMap inner object, because it's private. :\
>
> Good idea. You can use JIRA <https://issues.apache.org/jira/browse/MATH>
> to register a request for implementing this. Patches are of course welcome.
> There should probably be two iterators: one for all entries and one for the
> non-default entries (which may be non-zeroes or non-NaN or anything else).
>

I'll open up a ticket and attach a patch (with tests, naturally) later
today.


> This API is set up the way I get it from an external contributor, so I
> guess he had a use case for that. I extended it to remain in the same spirit
> and get this huge mess. I'm sorry for that. I agree a more generic method
> would be interesting. Removing these methods would however introduce an
> incompatible API change, so this could be done only in a major release (i.e.
> 3.0) which is probably a long time from now.
>

Yeah, this is why I'm sad I missed the refactoring push to hit 2.0.  For
now, however, a lot of implementation pain could get avoided with the
iterator() and iterateNonDefault(), together with a single
AbstractRealVector which has a default implementation of all of these crazy
methods, for implementations which don't need to think about them.


> The generic method should also either be provided in two versions (all
> entries and non-default entries) or it should have an iterator argument. For
> example the cosine and exponential functions transform a zero entry into a
> non-zero entry so they cannot ignore zero entries.
>
> >
> >   * while we're at it, if there is map(), why not also double
> > RealVector.collect(Collector()), where Collector defines void
> > collect(int
> > index, double value); and double result(); - this can be used for
> > generic
> > inner products and kernels (and can allow for consolidating all of
> > the
> > L1Norm(), norm(), and LInfNorm() methods into this same method,
> > passing in
> > different L1NormCollector() etc... instances).
>
> Godd idea too. Another JIRA ticket for that ?
>

JIRA ticket, tests, patch on the way.  Maybe today, we'll see. :)


>
> >
> >   * why all the methods which are overloaded to take either RealVector
> > or
> > double[] (getDistance, dotProduct, add, etc...) - is there really that
> > much
> > overhead in just implementing dotProduct(double[] d)  as just
> > dotProduct(new
> > ArrayRealVector(d, false)); - no copy is done, nothing is done but
> > one
> > object creation...
>
> It's not the copy that could take time, but the iteration which needs to
> call getEntry(). So yes, there is some overhead and it can be avoided by
> providing the simple array version. Of course, a default implementation that
> wraps the array into an ArrayRealVector can be added to the
> AbstractRealVector class you proposed above, in order to simplify new
> implementations.
>

This depends on whether the implementation details:
ArrayRealVector.dotProduct when passed another instance of ArrayRealVector,
they have access to each others internals, and can avoid this getEntry()
call altogether.  Other subclasses can have similar speedup strategies.  I
can try and whip up a patch and some perf tests to check speed of these
operations to verify - another JIRA ticket, I think? :)


> >   * SparseVector is just a marker interface?  Does it serve any
> > purpose?
>
> For now, yes it is a marker interface. There was some discussion about
> these interfaces just before the release of 2.0. the conclusion was that
> they should remain semple markers at that time.
>

Fair enough.


> The idea was really that people could provide their own implementations.
> Some methods that are close in spirit to the iterators you ask for are in
> the matrix interfaces (the walkXxx methods) and are used in many algorithms
> inside [math].
>

Ok great, I'll try to play around with those.


> If you intend to contribute them to [math], you'll have to put them on JIRA
> and send a Software Grant <http://www.apache.org/licenses/#grants> to
> Apache secretary. If you develop contributions directly for [math] (i.e. if
> it is not preexisting software), then rather than a Software Grant we will
> need either a Contributor License Agreement (CLA), either an Individual CLA
> or a Corporate CLA <http://www.apache.org/licenses/#clas>.
>

Yeah, I'm down with the "apache way", I'll attach patches to the JIRA
tickets after clicking the lovely "you can have this" button.  None of the
stuff I'm talking about contributing is a "large body of code" which needs a
special grant (I'm sending Mahout a bunch of stuff which may need that,
although I'm the only contributor to the project I'm donating, so I'm not
sure the need even in that case).

    -jake

Re: [math] Questions about the linear package

Reply via email to