Re: Standards for mail archive statistics gathering?

Steve Blackmon Wed, 06 May 2015 09:49:51 -0700

> For visualization, for sure, json is the current natural format when data is
> consumed from the browser.
> I don't have great experience on this, and what I'm missing with json
> currently is a common practice on documenting a structure: are there common
> practices?


In podling streams [0], we make extensive use of json schema [1] from
which we generate POJOs with a maven
plugin jsonschema2pojo [2] which makes manipulating the objects in
Java/Scala pleasant.  I expect other languages have
similar jsonschema-based ORM paradigms as well.  This pattern supports
inheritance both within
and across projects - for example see how [3] extends [4] which
extends [5].  These schemas are relatively self documenting,
but generating documentation or other artifacts is straight-forward as
they are themselves json documents.

> Because for simple json structure, documentation is not really necessary, but
> once the structure goes complex, documentation is really a key requirement for
> people to use or extend. And I already see this shortcoming with the 11 json
> files from projects-new.a.o = https://projects-new.apache.org/json/foundation/

Having used these json documents a few weeks ago to build an apache
community visualization [6] IMO the current crop of project-new jsons
are intermediate artifacts rather than a sufficiently cross-purpose
data model, a role currently held by DOAP mbox and misc others all
with some inherent shortcomings most notably lack of navigability
between silos.  I'd like to nominate activity streams [7] with
community-specific extensions (such as those roughly prototyped here:
[8] ) as a potential core data model for this effort going forward and
I'm happy to help apply some of the useful tools and connectors within
podling streams toward that end.  Converting external structured
sources into normalized documents and indexing those activities to
power data-centric APIs and visualizations are wheelhouse use cases
for this project, as they say.

[0] http://streams.incubator.apache.org/
[1] http://json-schema.org/documentation.html
[2] http://www.jsonschema2pojo.org/
[3] 
https://github.com/steveblackmon/streams-apache/blob/master/activities/src/main/jsonschema/objectTypes/committee.json
[4] 
https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/main/jsonschema/objectTypes/group.json
[5] 
https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/main/jsonschema/object.json
[6] http://72.182.111.65:3000/workspace/3
[7] http://activitystrea.ms/
[8] 
https://github.com/steveblackmon/streams-apache/blob/master/activities/src/main/jsonschema

Steve Blackmon
sblack...@apache.org

On Wed, May 6, 2015 at 2:05 AM, Hervé BOUTEMY <herve.bout...@free.fr> wrote:
> Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit :
>> On 5/5/15 7:33 AM, Boris Baldassari wrote:
>> > Hi Folks,
>> >
>> > Sorry for the late answer on this thread. Don't know what has been done
>> > since then, but I've some experience to share on this, so here are my 2c..
>>
>> No, more input is always appreciated!  Hervé is doing some
>> centralization of the projects-new.a.o data capture, which is related
>> but slightly separate.
> +1
> this can give a common place to put code once experiments show that we should
> add a new data source
>
>> But this is going to be a long-term project
> +1
>
>> with
>> plenty of different people helping I bet.
> I hope so...
>
>>
>> ...
>>
>> > * Parsing mboxes for software repository data mining:
>> > There is a suite of tools exactly targeted at this kind of duty on
>> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I
>> > don't know how they manage time zones, but the toolsuite is widely used
>> > around (see [3] or [4] as examples) so I believe they are quite robust.
>> > It includes tools for data retrieval as well as visualisation.
>>
>> Drat.  Metrics Grimoire looks pretty nifty - essentially a set of
>> frameworks for extracting metadata from a bunch of sources - but it's
>> GPL, so personally I have no interest in working on it.  If someone else
>> uses it to generate datasets that's great.
>>
>> > * As for the feedback/thoughts about the architecture and formats:
>> > I love the REST-API idea proposed by Rob. That's really easy to access
>> > and retrieve through scripts on-demand. CSV and JSON are my favourite
>> > formats, because they are, again, easy to parse and widely used -- every
>> > language and library has some facility to read them natively.
>>
>> Yup - again, like project visualization, to make any of this simple for
>> newcomers to try stuff, we need to separate data gathering / model /
>> visualization.  Since most of these are spare time projects, having easy
>> chunks makes it simpler for different people to try their hand at it.
> For visualization, for sure, json is the current natural format when data is
> consumed from the browser.
> I don't have great experience on this, and what I'm missing with json
> currently is a common practice on documenting a structure: are there common
> practices?
> Because for simple json structure, documentation is not really necessary, but
> once the structure goes complex, documentation is really a key requirement for
> people to use or extend. And I already see this shortcoming with the 11 json
> files from projects-new.a.o = https://projects-new.apache.org/json/foundation/
>
> Regards,
>
> Hervé
>
>>
>> Thanks,
>>
>> - Shane
>

Re: Standards for mail archive statistics gathering?

Reply via email to