On Feb 3, 2016 4:06 AM, "Tony Stevenson" <t...@pc-tony.com> wrote:
> cc += infra@ > > Karanjeet, > > I am writing to you whilst wearing my Infrastructure hat. > > Please be careful if you are indeed recursing the entire ASF subversion > repository (http://svn.apache.org) - as you will quite likely run into > the aug-banning service. > Have you seen https://svn-dump.apache.org ? This is an entire dump of > the SVN repo (at least the public one you are interested in. You can use > this, and it is updated monthly. If you really need fully upto date data > you can use the dump, and svnsync the remaining revisions. > > I guess this might be obvious, but I’ll mention it just in case. A lot of > projects are using git repositories too. Which are mirrored here: > github.com/apache/ > > > > -- > Cheers, > Tony > > On behalf of the Apache Infrastructure Team > > ----------------------- > http://www.pc-tony.com > GPG - 3072D/2543E323 > ----------------------- > > > On 3 Feb 2016, at 08:58, Karanjeet Singh <karan...@usc.edu> wrote: > > > > Thanks Pierre for your feedback. > > > > Yes, the visualization corresponds to only 133 / 191 SVN projects ( > > http://svn.apache.org/repos/asf/). We have successfully audited close to > > 175 projects and hopefully by the end of this week all the remaining > > projects should be covered. We will update the data once done. > > > > Large repositories like "subversion" and "camel" having 493,420 files > (size > > - 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up > to > > 36 hours (only) to complete which is quite a good number. > > > > For your second question, I don't have an answer yet. Our intentions will > > be to update this regularly but we have some limitation at the Wrangler > end > > that it doesn't allow us to run a job for more than 48 hours. Therefore, > > for very large repositories like openoffice, spamassassin, myfaces, etc, > > which takes more time to get audited, it will be a challenge to split the > > repositories every time and scan. > > > > Best Regards, > > Karanjeet Singh > > CS Graduate Student > > University of Southern California > > karan...@usc.edu | +1-213-675-9583 > > > > > > On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pierre.sm...@gmail.com> > > wrote: > > > >> HI Karanjeet, > >> > >> This is surely an impressive piece of work. But I still notice that some > >> projects are missing in the overview. Is this a mere PoC not intended > to be > >> complete? Or something that will be made available to all and be updated > >> regularly? > >> > >> Best regards, > >> > >> Pierre Smits > >> > >> ORRTIZ.COM < > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqcWeBc3x1dze3BDFEgOry1zo&e= > >>> > >> OFBiz based solutions & services > >> > >> OFBiz Extensions Marketplace > >> > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e= > >> > >> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney < > >> lewis.mcgibb...@gmail.com> wrote: > >> > >>> Hi Karanjeet, > >>> > >>> A good bunch of work has lready gone into this and it is looking really > >>> friggin smart indeed. > >>> Interesting to see some many pieces of software come together and > result > >> in > >>> something very easy to interpret. > >>> Good work. > >>> Lewis > >>> > >>> On Mon, Feb 1, 2016 at 11:44 PM, <dev-digest-h...@community.apache.org > > > >>> wrote: > >>> > >>>> Hello Everyone, > >>>> > >>>> With great pleasure, I would like to introduce DRAT (Distributed > >> Release > >>>> Audit Tool) which is a distributed, parallelized wrapper around Apache > >>> RAT > >>>> to inspect for appropriate open source licensing in software projects. > >>>> DRAT was started by my advisor, Chris Mattmann, in an effort to get > RAT > >>>> working on a ver large code base. RAT uses Apache OODT, Apache Tika, > >> and > >>>> Apache Solr. > >>>> > >>>> We are now auditing the complete Apache SVN code base to check for > >> proper > >>>> licenses. Until now, we have scanned 171 / 191 repositories and > >>>> illustrated the statistics for 133 of them through D3 visualization > >>>> located at > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A8080_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=EiqoixInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e= > >>>> > >>>> Projects should check out the MIME analysis of the code base and click > >>>> around. Please also note due to the sheer size of the Apache code > bases > >>>> and the fact that we scanned and included all revisions in the Apache > >> SVN > >>>> repo, DRAT is not running in real time. We are running DRAT on the NSF > >>>> Super Computer Wrangler, which has a petabyte of flash storage and the > >>>> ability to stand up Hadoop and Spark clusters. We are also working on > a > >>>> paper describing our results. > >>>> > >>>> Please send feedback to myself (Karanjeet Singh <karan...@usc.edu>), > >>>> Professor Mattmann <mattm...@usc.edu> and/or > >> ird...@mymaillists.usc.edu. > >>>> > >>>> Thanks & Regards, > >>>> Karanjeet Singh > >>>> C.S. Graduate Student > >>>> University of Southern California > >>>> karan...@usc.edu | +1-213-675-9583 > >>> > >> > >