Re: [math] Questions about the linear package

luc . maisonobe Wed, 14 Oct 2009 03:02:19 -0700

----- "Jake Mannix" <jake.man...@gmail.com> a écrit :

> Greetings, commons-math!
> 
>   I've been looking at a variety of apache/bsd-licensed linear
> libraries for
> use in massively parallel machine-learning applications I've been
> working on
> (I am housing my own open-source library at
> http://decomposer.googlecode.com,
> and am looking at integrating with/using/contributing to Apache
> Mahout), and
> I'm wondering a little about the linear API there is here in
> commons-math:
> 
>   * also for RealVector - No iterator methods?  So if the
> implementation is
> sparse, there's no way to just iterate over the non-zero entries? 
> What's
> worse, you can't even subclass OpenMapVector and expose the iterator
> on the
> OpenIntToDoubleHashMap inner object, because it's private. :\


Good idea. You can use JIRA <https://issues.apache.org/jira/browse/MATH> to 
register a request for implementing this. Patches are of course welcome.
There should probably be two iterators: one for all entries and one for the 
non-default entries (which may be non-zeroes or non-NaN or anything else).

> 
>   * for RealVector - what's with the million-different methods
> mapXXX(),
> mapXXXtoSelf()?  Why not just map(UnaryFunction()), and
> mapToSelf(UnaryFunction()), where UnaryFunction defines the single
> method
> double apply(double d); ?  Any user who wishes to implement RealVector
> (to
> say, make a more efficient specialized SparseVector) has to go through
> the
> pain of writing up a million methods dealing with these (and even if
> copy/paste gets most of this,  it still leads to some horribly huge
> .java
> files filled with junk that does not appear to be used).  There does
> not
> even appear to be an AbstractRealVector which implements all of these
> for
> you (by using the above-mentioned iterator() ).

This API is set up the way I get it from an external contributor, so I guess he 
had a use case for that. I extended it to remain in the same spirit and get 
this huge mess. I'm sorry for that. I agree a more generic method would be 
interesting. Removing these methods would however introduce an incompatible API 
change, so this could be done only in a major release (i.e. 3.0) which is 
probably a long time from now.

The generic method should also either be provided in two versions (all entries 
and non-default entries) or it should have an iterator argument. For example 
the cosine and exponential functions transform a zero entry into a non-zero 
entry so they cannot ignore zero entries.

> 
>   * while we're at it, if there is map(), why not also double
> RealVector.collect(Collector()), where Collector defines void
> collect(int
> index, double value); and double result(); - this can be used for
> generic
> inner products and kernels (and can allow for consolidating all of
> the
> L1Norm(), norm(), and LInfNorm() methods into this same method,
> passing in
> different L1NormCollector() etc... instances).

Godd idea too. Another JIRA ticket for that ?

> 
>   * why all the methods which are overloaded to take either RealVector
> or
> double[] (getDistance, dotProduct, add, etc...) - is there really that
> much
> overhead in just implementing dotProduct(double[] d)  as just
> dotProduct(new
> ArrayRealVector(d, false)); - no copy is done, nothing is done but
> one
> object creation...

It's not the copy that could take time, but the iteration which needs to call 
getEntry(). So yes, there is some overhead and it can be avoided by providing 
the simple array version. Of course, a default implementation that wraps the 
array into an ArrayRealVector can be added to the AbstractRealVector class you 
proposed above, in order to simplify new implementations.

> 
>   * SparseVector is just a marker interface?  Does it serve any
> purpose?

For now, yes it is a marker interface. There was some discussion about these 
interfaces just before the release of 2.0. the conclusion was that they should 
remain semple markers at that time.

> 
> I guess I could ask similar questions on the Matrix interfaces, but
> maybe
> those will probably be cleared up by understanding the philosophy
> behind the
> Vector interfaces.
> 
> I'd love to use commons-math for parts of my projects in which the
> entire
> data sets can live in memory (often part of the computation falls into
> this
> category, even if it's not the most meaty part, it's big enough that
> I'll
> kill my performance if I am stuck writing my own subroutines for
> eigen
> computation, etc for many moderately small matrices), but converting
> two and
> from the commons-math linear interfaces seem a bit unweildy.  Maybe it
> would
> be easier if I could understand why these are the way they are.

The idea was really that people could provide their own implementations. Some 
methods that are close in spirit to the iterators you ask for are in the matrix 
interfaces (the walkXxx methods) and are used in many algorithms inside [math].

> 
> I'm happy to contribute patches consolidating interfaces and/or
> extending

Fine. We are always happy to see a community growing around our components.

> functionality (you seem to be missing a compact int/double pair
> implementation of sparse vectors, for example, which are a
> fantasticly
> performant format if they're immutable and only being used for dot
> products
> and adding them to dense vectors), if it would be of help (I'm
> tracking my
> attempts at this over on my GitHub clone of trunk:
> http://github.com/jakemannix/commons-math ).

If you intend to contribute them to [math], you'll have to put them on JIRA and 
send a Software Grant <http://www.apache.org/licenses/#grants> to Apache 
secretary. If you develop contributions directly for [math] (i.e. if it is not 
preexisting software), then rather than a Software Grant we will need either a 
Contributor License Agreement (CLA), either an Individual CLA or a Corporate 
CLA <http://www.apache.org/licenses/#clas>.

Thanks
Luc

> 
>   -jake mannix
>   Principal Software Engineer
>   Search and Recommender Systems
>   LinkedIn.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] Questions about the linear package

Reply via email to