Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?
Hello. Le mar. 28 mai 2019 à 20:36, Alex Herbert a écrit : > > > > > On 28 May 2019, at 18:09, Eric Barnhill wrote: > > > > The previous commons-math interface for descriptive statistics used a > > paradigm of constructing classes for various statistical functions and > > calling evaluate(). Example > > > > Mean mean = new Mean(); > > double mn = mean.evaluate(double[]) > > > > I wrote this type of code all through grad school and always found it > > unnecessarily bulky. To me these summary statistics are classic use cases > > for static methods: > > > > double mean .= Mean.evaluate(double[]) > > > > I don't have any particular problem with the evaluate() syntax. > > > > I looked over the old Math 4 API to see if there were any benefits to the > > previous class-oriented approach that we might not want to lose. But I > > don't think there were, the functionality outside of evaluate() is minimal. > > A quick check shows that evaluate comes from UnivariateStatistic. This has > some more methods that add little to an instance view of the computation: > > double evaluate(double[] values) throws MathIllegalArgumentException; > double evaluate(double[] values, int begin, int length) throws > MathIllegalArgumentException; > UnivariateStatistic copy(); > > However it is extended by StorelessUnivariateStatistic which adds methods to > update the statistic: > > void increment(double d); > void incrementAll(double[] values) throws MathIllegalArgumentException; > void incrementAll(double[] values, int start, int length) throws > MathIllegalArgumentException; > double getResult(); > long getN(); > void clear(); > StorelessUnivariateStatistic copy(); > > This type of functionality would be lost by static methods. > > If you are moving to a functional interface type pattern for each statistic > then you will lose the other functionality possible with an instance state, > namely updating with more values or combining instances. > > So this is a question of whether updating a statistic is required after the > first computation. > > Will there be an alternative in the library for a map-reduce type operation > using instances that can be combined using Stream.collect: > > R collect(Supplier supplier, > ObjDoubleConsumer accumulator, > BiConsumer combiner); > > Here would be Mean: > > double mean = Arrays.stream(new double[1000]).collect(Mean::new, Mean::add, > Mean::add).getMean() with: > > void add(double); > void add(Mean); > double getMean(); > > (Untested code) > > > > > Finally we should consider whether we really need a separate class for each > > statistic at all. Do we want to call: > > > > Mean.evaluate() > > > > or > > > > SummaryStats.mean() > > > > or maybe > > > > Stats.mean() ? > > > > The last being nice and compact. > > > > Let's make a decision so our esteemed mentee Virendra knows in what > > direction to take his work this summer. :) > I'm not sure I understand the implicit conclusions of this conversation and the other one there: https://markmail.org/message/7dmyhzuy6lublyb5 Do we agree that the core issue is *not* how to compute a mean, or a median, or a fourth moment, but how any and all of those can be computed seamlessly through a functional API (stream)? As Alex pointed out, a useful functionality is the ability to "combine" instances, e.g. if data are collected by several threads. A potential use-case is the retrieval of the current value of (any) statistical quantities while the data continues to be collected. An initial idea would be: public interface StatQuantity { public double value(double[]); // For "basic" usage. public double value(DoubleStream); // For "advanced" usage. } public class StatCollection { /** Specify which quantities this collection will hold/compute. */ public StatCollection(Map stats) { /*... */ } /** * Start a worker thread. * @param data Values for which the stat quantities must be computed. */ public void startCollector(DoubleStream data) { /* ... */ } /** Combine current state of workers. */ public void collect() { /* ... */ } /** @return the current (combined) value of a named quantity. */ public double get(String name) { /* ... */ } private StatCollector implements Callable { StatCollector(DoubleStream data) { /* ... */ } } } This is all totally untested, very partial, and probably wrong-headed but I thought that we were looking at this kind of refactoring. Regards, Gilles - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [lang] Giant StringUtils vs. NullSafeStrings.
On Tue, May 28, 2019 at 5:04 PM Mark Dacek wrote: > I’m a bit curious on the desire to split it out. I’m not hard opposed but > also don’t know that it would save much time or clarify things for most. I > wouldn’t want to say that this is a critical reason for keeping things as > they are, but I’d imagine that your typical dev doesn’t use StringUtils for > abbreviate or getLevenshteinDistance style-features nearly as often as they > do for length and indexOf to avoid onerous unit tests. > The immediate motivation is that StringUtils does not have a null-safe version of String#getBytes(String). It does have null-safe versions of a _few_ methods from for Strings and CharSequences. Instead of making StringUtils even larger and more unwieldy, I was thinking that having class more focused on "I want to call String methods in a null safe way" as opposed to "I want to perform non-JRE-standard operation on Strings" The obvious drawback is that you now have two classes to mine for functionality instead of one, so typing StringUtils in your IDE will not present the new methods in the NullSafeStrings, you'd have to remember/know what belongs where; IOW you'd have to know what belongs in String vs. not. It was just an idea. For now, I am happy to just add a null safe String#getBytes(String) to StringUtils. Gary > On Tue, May 28, 2019 at 7:53 AM Gary Gregory > wrote: > > > Hi All: > > > > Right now we have a giant class called StringUtils. I have code that in > my > > own library that has at least one null-safe API that for Strings. For > > example a String.getBytes(String, Charset) that returns a null byte[] if > > the input String is null. > > > > I'd like to propose a new class called NullSafeStrings, that covers all > > String APIs (there aren't that many) for null-safe String input. If some > of > > these APIs are already in StringUtils, those would be deprecated and > point > > to NullSafeStrings. > > > > Note that I am not using the "Utils" postfix on purpose since I find it > > meaning less and the JRE now uses the plural form for this kind of code; > > see Files and Paths. > > > > Thoughts? > > > > Gary > > >
Re: svn commit: r1859859 [1/5] - in /commons/proper/jcs/trunk/commons-jcs-core/src: main/java/org/apache/commons/jcs/ main/java/org/apache/commons/jcs/access/ main/java/org/apache/commons/jcs/admin/ m
Thanks, I've reverted SVN back to r1852304 which corresponds to git ad897014842fc830483f32fdfb903f3bb8f70289, i.e. immediately after conversion to Git. Also moved SVN to _moved_to_git On Wed, 29 May 2019 at 07:02, Thomas Vandahl wrote: > > On 27.05.19 10:21, sebb wrote: > > Either re-apply each of the commits in turn to Git. > Done. > > Bye, Thomas > > - > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
[ALL] Issues with Git repos: bad HEADs and unexpected refs - need Git guru
I noticed that JCS still has a trunk branch and that appears to be the default, so I did a check of all the commons git repos. The command used is: git ls-remote #{repo} HEAD master trunk where repo = https://gitbox.apache.org/repos/asf/{commons-???}.git I then compared the hashes for HEAD and refs/heads/master. If different, then the HEAD is not master. Some of the repos have references other than HEAD, refs/heads/master or refs/remotes/trunk (which I assume are OK?) The discrepancies are listed below. The most important ones are obviously the ones with bad heads - I guess we should ask Infra to reset those. But what if changes have been made to trunk? I don't know what to do about the additional references, if anything. Unexpected HEADs: https://gitbox.apache.org/repos/asf/commons-configuration.git 61732d3c65cbfae58d71e2dd55caf8760a9aa642 HEAD **not master** 61732d3c65cbfae58d71e2dd55caf8760a9aa642 refs/heads/trunk REF? 61732d3c65cbfae58d71e2dd55caf8760a9aa642 refs/remotes/trunk 96720ced6f263462aaae7217392399267b1d141f refs/heads/master https://gitbox.apache.org/repos/asf/commons-jcs.git ad897014842fc830483f32fdfb903f3bb8f70289 HEAD **not master** ad897014842fc830483f32fdfb903f3bb8f70289 refs/heads/trunk REF? ad897014842fc830483f32fdfb903f3bb8f70289 refs/remotes/trunk b59d43e64e3759dac71b4b68bf43f3ac32d484c3 refs/heads/master Unexpected REFS (HEAD looks OK): https://gitbox.apache.org/repos/asf/commons-weaver.git 141e820c70fa3a86d49b740076b11ba50dc8b456 refs/remotes/origin/master REF? 9329b1486e28b4d7112dc30ba381892dccb924db refs/remotes/trunk a88dbbbd76aed4ac98d94ceb2e1bc9bd2183aa84 HEAD a88dbbbd76aed4ac98d94ceb2e1bc9bd2183aa84 refs/heads/master https://gitbox.apache.org/repos/asf/commons-logging.git 0548efba5be8c7dd04f71d81e642488fec6f5472 HEAD 0548efba5be8c7dd04f71d81e642488fec6f5472 refs/heads/master 0548efba5be8c7dd04f71d81e642488fec6f5472 refs/heads/trunk REF? 0548efba5be8c7dd04f71d81e642488fec6f5472 refs/remotes/trunk e7f1bbad19cbd1f4ed3085b83d2866c9bc247815 refs/original/refs/remotes/trunk REF? https://gitbox.apache.org/repos/asf/commons-codec.git 163d643d1176e0dc9334ee83e21b9ce21d24fc1a refs/remotes/trunk 3ee99ea6cabeee96d5dfbb45e43b1e8cf74a9929 refs/original/refs/heads/trunk REF? 3ee99ea6cabeee96d5dfbb45e43b1e8cf74a9929 refs/original/refs/remotes/trunk REF? 48b615756d1d770091ea3322eefc08011ee8b113 HEAD 48b615756d1d770091ea3322eefc08011ee8b113 refs/heads/master https://gitbox.apache.org/repos/asf/commons-proxy.git 838024d7d8193d181506568fec9e681b6e11e59a HEAD 838024d7d8193d181506568fec9e681b6e11e59a refs/heads/master 838024d7d8193d181506568fec9e681b6e11e59a refs/heads/trunk REF? 838024d7d8193d181506568fec9e681b6e11e59a refs/remotes/trunk - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?
At the end of the day, like we just saw on the user list today. users are going to come around with arrays and want to get the mean, median, variance, or quantiles of that array. The easiest way to do this is to have some sort of static method that delivers these: double mean = Stats.mean(double[] data) and the user doesn't have to think more than that. Yes this should implemented functionally, although in this simple case we probably just need to call Java's SummaryStats() under the hood. If we overcomplicate this, again like we just saw on the user list, users will simply not use the code. Then yes, I agree Alex's argument for updateable instances containing state is compelling. How to relate these more complicated instances with the simple cases is a great design question. But first, let's nail the Matlab/Numpy case of just having an array of doubles and wanting the mean / median. I am just speaking of my own use cases here but I used exactly this functionality all the time: Mean m = new Mean(). double mean = m.evaluate(data) and I think this should be the central use case for the new module. On Wed, May 29, 2019 at 4:51 AM Gilles Sadowski wrote: > Hello. > > Le mar. 28 mai 2019 à 20:36, Alex Herbert a > écrit : > > > > > > > > > On 28 May 2019, at 18:09, Eric Barnhill > wrote: > > > > > > The previous commons-math interface for descriptive statistics used a > > > paradigm of constructing classes for various statistical functions and > > > calling evaluate(). Example > > > > > > Mean mean = new Mean(); > > > double mn = mean.evaluate(double[]) > > > > > > I wrote this type of code all through grad school and always found it > > > unnecessarily bulky. To me these summary statistics are classic use > cases > > > for static methods: > > > > > > double mean .= Mean.evaluate(double[]) > > > > > > I don't have any particular problem with the evaluate() syntax. > > > > > > I looked over the old Math 4 API to see if there were any benefits to > the > > > previous class-oriented approach that we might not want to lose. But I > > > don't think there were, the functionality outside of evaluate() is > minimal. > > > > A quick check shows that evaluate comes from UnivariateStatistic. This > has some more methods that add little to an instance view of the > computation: > > > > double evaluate(double[] values) throws MathIllegalArgumentException; > > double evaluate(double[] values, int begin, int length) throws > MathIllegalArgumentException; > > UnivariateStatistic copy(); > > > > However it is extended by StorelessUnivariateStatistic which adds > methods to update the statistic: > > > > void increment(double d); > > void incrementAll(double[] values) throws MathIllegalArgumentException; > > void incrementAll(double[] values, int start, int length) throws > MathIllegalArgumentException; > > double getResult(); > > long getN(); > > void clear(); > > StorelessUnivariateStatistic copy(); > > > > This type of functionality would be lost by static methods. > > > > If you are moving to a functional interface type pattern for each > statistic then you will lose the other functionality possible with an > instance state, namely updating with more values or combining instances. > > > > So this is a question of whether updating a statistic is required after > the first computation. > > > > Will there be an alternative in the library for a map-reduce type > operation using instances that can be combined using Stream.collect: > > > > R collect(Supplier supplier, > > ObjDoubleConsumer accumulator, > > BiConsumer combiner); > > > > Here would be Mean: > > > > double mean = Arrays.stream(new double[1000]).collect(Mean::new, > Mean::add, Mean::add).getMean() with: > > > > void add(double); > > void add(Mean); > > double getMean(); > > > > (Untested code) > > > > > > > > Finally we should consider whether we really need a separate class for > each > > > statistic at all. Do we want to call: > > > > > > Mean.evaluate() > > > > > > or > > > > > > SummaryStats.mean() > > > > > > or maybe > > > > > > Stats.mean() ? > > > > > > The last being nice and compact. > > > > > > Let's make a decision so our esteemed mentee Virendra knows in what > > > direction to take his work this summer. :) > > > > I'm not sure I understand the implicit conclusions of this conversation > and the other one there: > https://markmail.org/message/7dmyhzuy6lublyb5 > > Do we agree that the core issue is *not* how to compute a mean, or a > median, or a fourth moment, but how any and all of those can be > computed seamlessly through a functional API (stream)? > > As Alex pointed out, a useful functionality is the ability to "combine" > instances, e.g. if data are collected by several threads. > A potential use-case is the retrieval of the current value of (any) > statistical quantities while the data continues to be collected. > > An initial idea would be: > public interface StatQuantity { > public doubl
Re: [GSoC] Thursday mentee meeting
Hi All, Would anyone like to have another online meeting as per previous weeks? 5PM UTC. Alex - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
[gsoc] Weekly meeting tomorrow
Let's have another weekly gathering tomorrow for GSoC mentees at the usual time. Everyone should have written at least some code, and a unit test that goes with that code, and submitted it for review via a PR. If you have difficulties doing this, please raise questions on the Slack. Thanks to the contributors who are helping out so much at these meetings.
[codec] Next Commons Codec Release Date
Hi, Just wondering when the next planned release date of commons codec will be ? Thanks, Jack. Sent from Mail for Windows 10
Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?
On 29/05/2019 12:50, Gilles Sadowski wrote: Hello. Le mar. 28 mai 2019 à 20:36, Alex Herbert a écrit : On 28 May 2019, at 18:09, Eric Barnhill wrote: The previous commons-math interface for descriptive statistics used a paradigm of constructing classes for various statistical functions and calling evaluate(). Example Mean mean = new Mean(); double mn = mean.evaluate(double[]) I wrote this type of code all through grad school and always found it unnecessarily bulky. To me these summary statistics are classic use cases for static methods: double mean .= Mean.evaluate(double[]) I don't have any particular problem with the evaluate() syntax. I looked over the old Math 4 API to see if there were any benefits to the previous class-oriented approach that we might not want to lose. But I don't think there were, the functionality outside of evaluate() is minimal. A quick check shows that evaluate comes from UnivariateStatistic. This has some more methods that add little to an instance view of the computation: double evaluate(double[] values) throws MathIllegalArgumentException; double evaluate(double[] values, int begin, int length) throws MathIllegalArgumentException; UnivariateStatistic copy(); However it is extended by StorelessUnivariateStatistic which adds methods to update the statistic: void increment(double d); void incrementAll(double[] values) throws MathIllegalArgumentException; void incrementAll(double[] values, int start, int length) throws MathIllegalArgumentException; double getResult(); long getN(); void clear(); StorelessUnivariateStatistic copy(); This type of functionality would be lost by static methods. If you are moving to a functional interface type pattern for each statistic then you will lose the other functionality possible with an instance state, namely updating with more values or combining instances. So this is a question of whether updating a statistic is required after the first computation. Will there be an alternative in the library for a map-reduce type operation using instances that can be combined using Stream.collect: R collect(Supplier supplier, ObjDoubleConsumer accumulator, BiConsumer combiner); Here would be Mean: double mean = Arrays.stream(new double[1000]).collect(Mean::new, Mean::add, Mean::add).getMean() with: void add(double); void add(Mean); double getMean(); (Untested code) Finally we should consider whether we really need a separate class for each statistic at all. Do we want to call: Mean.evaluate() or SummaryStats.mean() or maybe Stats.mean() ? The last being nice and compact. Let's make a decision so our esteemed mentee Virendra knows in what direction to take his work this summer. :) I'm not sure I understand the implicit conclusions of this conversation and the other one there: https://markmail.org/message/7dmyhzuy6lublyb5 Do we agree that the core issue is *not* how to compute a mean, or a median, or a fourth moment, but how any and all of those can be computed seamlessly through a functional API (stream)? As Alex pointed out, a useful functionality is the ability to "combine" instances, e.g. if data are collected by several threads. A potential use-case is the retrieval of the current value of (any) statistical quantities while the data continues to be collected. An initial idea would be: public interface StatQuantity { public double value(double[]); // For "basic" usage. public double value(DoubleStream); // For "advanced" usage. } public class StatCollection { /** Specify which quantities this collection will hold/compute. */ public StatCollection(Map stats) { /*... */ } /** * Start a worker thread. * @param data Values for which the stat quantities must be computed. */ public void startCollector(DoubleStream data) { /* ... */ } /** Combine current state of workers. */ public void collect() { /* ... */ } /** @return the current (combined) value of a named quantity. */ public double get(String name) { /* ... */ } private StatCollector implements Callable { StatCollector(DoubleStream data) { /* ... */ } } } This is all totally untested, very partial, and probably wrong-headed but I thought that we were looking at this kind of refactoring. Regards, Gilles I don't think you can pass in a Stream to be worked on. The Stream API requires that you pass something into the stream and the stream contents are changed (intermediate operation) or consumed (terminating operation). Only when a terminating operation is invoked is the stream pipeline activated. So the new classes have to be useable in intermediate and terminating operations. If the idea of the refactoring was to move all the old API to a new API that can be used with streams then each Statistic should be based on ideas presented in: java.util.DoubleSummaryStatistics java.util.IntSummaryStatistics java
Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?
> On 29 May 2019, at 21:57, Eric Barnhill wrote: > > At the end of the day, like we just saw on the user list today. users are > going to come around with arrays and want to get the mean, median, > variance, or quantiles of that array. The easiest way to do this is to have > some sort of static method that delivers these: > > double mean = Stats.mean(double[] data) This Stats class can be just a utility class with static helper methods invoking the appropriate class implementation. All the algorithms should be in one place (to minimise code duplication). I don’t think calling SummaryStats under the hood is the best solution for these helper methods. It does a lot more work than is necessary to compute one metric. It should be done with individual classes for each metric and an appropriate helper method for each. Looking at math4 this would be helpers for: moment/FirstMoment.java moment/FourthMoment.java moment/GeometricMean.java moment/Kurtosis.java moment/Mean.java moment/SecondMoment.java moment/SemiVariance.java moment/Skewness.java moment/StandardDeviation.java moment/ThirdMoment.java moment/Variance.java rank/Max.java rank/Median.java rank/Min.java rank/Percentile.java summary/Product.java summary/Sum.java summary/SumOfLogs.java summary/SumOfSquares.java DescriptiveStatistics.java (mean, variance, StdDev, Max, Min, Count, Sum, Skewness, Kurtosis, Percentile) SummaryStatistics.java (mean, variance, StdDev, Max, Min, Count, Sum) Left out those operating on a double[] for each increment (not a single double): moment/VectorialCovariance.java moment/VectorialMean.java MultivariateSummaryStatistics.java Left out this as it is an approximation when the entire double[] cannot be held in memory: rank/PSquarePercentile.java Note that some metrics are not applicable to undefined data lengths and so cannot be written to support streams: Median > > and the user doesn't have to think more than that. Yes this should > implemented functionally, although in this simple case we probably just > need to call Java's SummaryStats() under the hood. If we overcomplicate > this, again like we just saw on the user list, users will simply not use > the code. > > Then yes, I agree Alex's argument for updateable instances containing state > is compelling. How to relate these more complicated instances with the > simple cases is a great design question. > > But first, let's nail the Matlab/Numpy case of just having an array of > doubles and wanting the mean / median. I am just speaking of my own use > cases here but I used exactly this functionality all the time: > > Mean m = new Mean(). > double mean = m.evaluate(data) > > and I think this should be the central use case for the new module. > > > On Wed, May 29, 2019 at 4:51 AM Gilles Sadowski > wrote: > >> Hello. >> >> Le mar. 28 mai 2019 à 20:36, Alex Herbert a >> écrit : >>> >>> >>> On 28 May 2019, at 18:09, Eric Barnhill >> wrote: The previous commons-math interface for descriptive statistics used a paradigm of constructing classes for various statistical functions and calling evaluate(). Example Mean mean = new Mean(); double mn = mean.evaluate(double[]) I wrote this type of code all through grad school and always found it unnecessarily bulky. To me these summary statistics are classic use >> cases for static methods: double mean .= Mean.evaluate(double[]) I don't have any particular problem with the evaluate() syntax. I looked over the old Math 4 API to see if there were any benefits to >> the previous class-oriented approach that we might not want to lose. But I don't think there were, the functionality outside of evaluate() is >> minimal. >>> >>> A quick check shows that evaluate comes from UnivariateStatistic. This >> has some more methods that add little to an instance view of the >> computation: >>> >>> double evaluate(double[] values) throws MathIllegalArgumentException; >>> double evaluate(double[] values, int begin, int length) throws >> MathIllegalArgumentException; >>> UnivariateStatistic copy(); >>> >>> However it is extended by StorelessUnivariateStatistic which adds >> methods to update the statistic: >>> >>> void increment(double d); >>> void incrementAll(double[] values) throws MathIllegalArgumentException; >>> void incrementAll(double[] values, int start, int length) throws >> MathIllegalArgumentException; >>> double getResult(); >>> long getN(); >>> void clear(); >>> StorelessUnivariateStatistic copy(); >>> >>> This type of functionality would be lost by static methods. >>> >>> If you are moving to a functional interface type pattern for each >> statistic then you will lose the other functionality possible with an >> instance state, namely updating with more values or combining instances. >>> >>> So this is a question of whether updating a statistic is required after >> the first computation. >>> >>> Will
Re: [codec] Next Commons Codec Release Date
Hi Jack, There are no current release plans. There are a couple of other components I'd like to release as well. So it is good you are registering interest. I suggest you go through JIRA and PRs on GitHub and see if there any help you can provide with outstanding issues. This would make an upcoming release even more valuable. Thank you, Gary On Wed, May 29, 2019 at 6:12 PM Tim Fake Doyle wrote: > Hi, > > Just wondering when the next planned release date of commons codec will be > ? > > Thanks, > Jack. > > Sent from Mail for Windows 10 > >