Thanks Don and Tony. Yes, we have used the http://svn-dump.apache.org/ link to download the SVN dump and then we are running DRAT on it.
The other link was just for reference. I hope, I am safe from the aug-banning service. :) Best Regards, Karanjeet Singh C.S. Graduate Student University of Southern California karan...@usc.edu | +1-213-675-9583 On Wed, Feb 3, 2016 at 1:07 AM, Don Cunningham <otto...@gmail.com> wrote: > On Feb 3, 2016 4:06 AM, "Tony Stevenson" <t...@pc-tony.com> wrote: > >> cc += infra@ >> >> Karanjeet, >> >> I am writing to you whilst wearing my Infrastructure hat. >> >> Please be careful if you are indeed recursing the entire ASF subversion >> repository (http://svn.apache.org >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=d_X9L9oLXCkkHS5f1V4oihsxSwxuq7o9IWaCkw2eb9M&e=>) >> - as you will quite likely run into the aug-banning service. >> Have you seen https://svn-dump.apache.org >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__svn-2Ddump.apache.org&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=PX-TjYkrYF2jtnk0eGgBJvKriwcbOcgIeENvi52T7sE&e=> >> ? This is an entire dump of the SVN repo (at least the public one you are >> interested in. You can use this, and it is updated monthly. If you really >> need fully upto date data you can use the dump, and svnsync the remaining >> revisions. >> >> I guess this might be obvious, but I’ll mention it just in case. A lot >> of projects are using git repositories too. Which are mirrored here: >> github.com/apache/ >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_apache_&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=a5TxI_VOrBw4vEQDR21R7aI59AIJINFcpGunOZJVAxQ&e=> >> >> >> >> -- >> Cheers, >> Tony >> >> On behalf of the Apache Infrastructure Team >> >> ----------------------- >> http://www.pc-tony.com >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pc-2Dtony.com&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=_j67WueILi3vYFsR4jNWB1_Aoyd4OhQxRso-rmUSmB4&e=> >> GPG - 3072D/2543E323 >> ----------------------- >> >> > On 3 Feb 2016, at 08:58, Karanjeet Singh <karan...@usc.edu> wrote: >> > >> > Thanks Pierre for your feedback. >> > >> > Yes, the visualization corresponds to only 133 / 191 SVN projects ( >> > http://svn.apache.org/repos/asf/ >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_repos_asf_&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=xCX9GMMgDA5qtKvRJHDNZee5gprmc0l0d06PjjB4DE8&e=>). >> We have successfully audited close to >> >> > 175 projects and hopefully by the end of this week all the remaining >> > projects should be covered. We will update the data once done. >> > >> > Large repositories like "subversion" and "camel" having 493,420 files >> (size >> > - 9,723 MB approx) and 519,584 files (size - 1,922 MB approx) taking up >> to >> > 36 hours (only) to complete which is quite a good number. >> > >> > For your second question, I don't have an answer yet. Our intentions >> will >> > be to update this regularly but we have some limitation at the Wrangler >> end >> > that it doesn't allow us to run a job for more than 48 hours. Therefore, >> > for very large repositories like openoffice, spamassassin, myfaces, etc, >> > which takes more time to get audited, it will be a challenge to split >> the >> > repositories every time and scan. >> > >> > Best Regards, >> > Karanjeet Singh >> > CS Graduate Student >> > University of Southern California >> > karan...@usc.edu | +1-213-675-9583 >> > >> > >> > On Wed, Feb 3, 2016 at 12:06 AM, Pierre Smits <pierre.sm...@gmail.com> >> > wrote: >> > >> >> HI Karanjeet, >> >> >> >> This is surely an impressive piece of work. But I still notice that >> some >> >> projects are missing in the overview. Is this a mere PoC not intended >> to be >> >> complete? Or something that will be made available to all and be >> updated >> >> regularly? >> >> >> >> Best regards, >> >> >> >> Pierre Smits >> >> >> >> ORRTIZ.COM >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ORRTIZ.COM&d=CwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=GwFuyVGIP6yVZZagar8dUlZNTgV_2g_CbdaYK0Bi3mM&s=huYGKDzK8FadQqoFw9-pi5_UxtIkWwv4jTfWLbDwFIs&e=> >> < >> >> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.orrtiz.com&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=rey8QvJVsx9VER8tfbyqcWeBc3x1dze3BDFEgOry1zo&e= >> >>> >> >> OFBiz based solutions & services >> >> >> >> OFBiz Extensions Marketplace >> >> >> >> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__oem.ofbizci.net_oci-2D2_&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=t-3eq7_jE8P3hTlTBYAQB9p_vFHuwoj6RqdbBBr8edI&e= >> >> >> >> On Wed, Feb 3, 2016 at 2:39 AM, Lewis John Mcgibbney < >> >> lewis.mcgibb...@gmail.com> wrote: >> >> >> >>> Hi Karanjeet, >> >>> >> >>> A good bunch of work has lready gone into this and it is looking >> really >> >>> friggin smart indeed. >> >>> Interesting to see some many pieces of software come together and >> result >> >> in >> >>> something very easy to interpret. >> >>> Good work. >> >>> Lewis >> >>> >> >>> On Mon, Feb 1, 2016 at 11:44 PM, < >> dev-digest-h...@community.apache.org> >> >>> wrote: >> >>> >> >>>> Hello Everyone, >> >>>> >> >>>> With great pleasure, I would like to introduce DRAT (Distributed >> >> Release >> >>>> Audit Tool) which is a distributed, parallelized wrapper around >> Apache >> >>> RAT >> >>>> to inspect for appropriate open source licensing in software >> projects. >> >>>> DRAT was started by my advisor, Chris Mattmann, in an effort to get >> RAT >> >>>> working on a ver large code base. RAT uses Apache OODT, Apache Tika, >> >> and >> >>>> Apache Solr. >> >>>> >> >>>> We are now auditing the complete Apache SVN code base to check for >> >> proper >> >>>> licenses. Until now, we have scanned 171 / 191 repositories and >> >>>> illustrated the statistics for 133 of them through D3 visualization >> >>>> located at >> >> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__drat.dyndns.org-3A8080_dratviz&d=CwIBaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=I4VmXy1BbrwbVZc9758zYzQ1Vg_gsve4ety_zu60Z7o&s=EiqoixInVvAF49_1n7AxSu4q_q7BYMJ53JbVnf7rWK4&e= >> >>>> >> >>>> Projects should check out the MIME analysis of the code base and >> click >> >>>> around. Please also note due to the sheer size of the Apache code >> bases >> >>>> and the fact that we scanned and included all revisions in the Apache >> >> SVN >> >>>> repo, DRAT is not running in real time. We are running DRAT on the >> NSF >> >>>> Super Computer Wrangler, which has a petabyte of flash storage and >> the >> >>>> ability to stand up Hadoop and Spark clusters. We are also working >> on a >> >>>> paper describing our results. >> >>>> >> >>>> Please send feedback to myself (Karanjeet Singh <karan...@usc.edu>), >> >>>> Professor Mattmann <mattm...@usc.edu> and/or >> >> ird...@mymaillists.usc.edu. >> >>>> >> >>>> Thanks & Regards, >> >>>> Karanjeet Singh >> >>>> C.S. Graduate Student >> >>>> University of Southern California >> >>>> karan...@usc.edu | +1-213-675-9583 >> >>> >> >> >> >>