Hello,

Shweta, I have now fixed your package in “master” and “RELEASE_3_8” branches. 
You should sync your package on GitHub either by taking a fresh clone from 
git.bioconductor.org 
(http://bioconductor.org/developers/how-to/git/maintain-github-bioc/) or 
syncing it 
(http://bioconductor.org/developers/how-to/git/sync-existing-repositories/). 
Please make sure there are no duplicate commits before you push from a local 
repository to the server. 

Please also check that the package works as expected, some commits may have 
been omitted, make sure the code and all the files look exactly like they are 
supposed to. 

Thank you. 

Nitesh,




P.S: This post-script is going to be some what long for the benefit of the 
other maintainers in the community. 

Just so other maintainers are informed, this was probably my most challenging 
“de-duplication”, “large-file” removal yet. One of the main concerns was the 
lack of informative commit messages in the package. 

NOTE: Lots of assumptions were made in fixing this package as I’m not the 
maintainer. Every maintainer should take precaution after such a process to 
make sure the package works as expected, and debug any issues.

I found the largest files in the restfulSE package using,

        >> du -ha 

This shows that the .pack file is the largest (257MB). The pack files contains 
indexed information for previous commits.

        >> git verify-pack -v 
./.git/objects/pack/pack-39dcca4030fa3e3901bb5874afa855110956cb4b.pack | sort 
-k 3 -n | tail -10                                                              
                                   
        d502afb0e1e3307f0185bf2463af330c0db01d45 blob   13144692 13148712 497110
        35452f5c77520daf966f0228e1a102b71f3302b1 blob   17148372 17018509 
121401976
        b501ea4f451e61790377b06da6f7072f6532b2c9 blob   17849118 17730695 
160967245
        1850dbfeb056878662e8b383fb2fbf0aa27f5ebd blob   25657619 22542379 
138421662
        a495c635ebe44020dd563c0cd4265a33dd76c8ec blob   26339453 22683993 
75191432
        7baad7ed1d83dbf2eeb8a97f4c542035a06adb43 blob   27136447 23505426 
51685564
        6c28ec53a7765775540422258716467143182fb8 blob   27153019 23522387 
97877531
        a34d64c68a7b02519905fa547674764351df4310 blob   27374529 23735207 
13645822
        64d12c30c8f5805c896d010929ea7c77e7e39279 blob   38726127 38541293 
230237839
        8bda5612a454c33e50efe4f87d9a7442e239ab23 blob   40075136 39119863 
187385432

The commit id’s show what commits caused the largest files. I sent the stdout 
commits to a new file called “shas”. 

        >>    while read sha; do                                                
                                                                                
                                                                  
                git rev-list --objects --all | grep $sha
                done < ../shas 

        d502afb0e1e3307f0185bf2463af330c0db01d45 data/banoSEMeta.rda
        35452f5c77520daf966f0228e1a102b71f3302b1 data/stfull_rse.rda
        b501ea4f451e61790377b06da6f7072f6532b2c9 data/banoSEMeta.rda
        1850dbfeb056878662e8b383fb2fbf0aa27f5ebd data/stfull.rda
        a495c635ebe44020dd563c0cd4265a33dd76c8ec data/full_1Mneurons.rda
        7baad7ed1d83dbf2eeb8a97f4c542035a06adb43 data/full_1Mneurons.rda
        6c28ec53a7765775540422258716467143182fb8 data/sefull.rda
        a34d64c68a7b02519905fa547674764351df4310 data/full_1Mneurons.rda
        64d12c30c8f5805c896d010929ea7c77e7e39279 data/tenx_100k_sorted.rda
        8bda5612a454c33e50efe4f87d9a7442e239ab23 data/tenx_100k_sorted.h5

This gave me the name of the files which were causing the issue.  (All the .rda 
files)

I used BFG cleaner to clean those files, 

        >> java -jar ~/Downloads/bfg-1.13.0.jar --strip-blobs-bigger-than 5M 
restfulSE

        >> git reflog expire --expire=now --all && git gc --prune=now 
--aggressive


Next, I wanted to clean the commit history because there are many duplicates. 
In this step, If you look at the commit history, it’s important to notice that 
the commits till the ID `40a0cf0` are duplicated. The commit before this is 
`f6e30f8` which is unduplicated. The goal is to reset till this unduplicated 
commit, and relay successive unduplicated commits on top of this. 
(https://github.com/shwetagopaul92/restfulSE/commits/master)


I do the following to overlay a series of commits next, 

        >> git log --oneline > commits.txt 

I manually unduplicated the “commits.txt”  file. (i.e literally just go through 
and delete every alternate line). Then, get only the commit ID’s, since we 
don’t need the commit messages. We have to overlay these commits in reverse 
order, this is important to remember.

        >> cat commits.txt | awk -F" " '{print $1}’ > commits 

         >> tail -r commits > commits_reversed ##reverse the order of the 
commits

I reset to the commit before duplicates

        ~/D/restfulSE ❯❯❯ git reset --hard f6e30f8                              
                                                                                
                                                                               
        HEAD is now at f6e30f8 drop .swp file


Now we have to overlay these commits onto the. NOTE: I’m oversimplifying the 
process here, there were a few “cherry-pick” conflicts which needed to be 
sorted out. Once the conflict was resolved, I’d have to curate the 
“commits_back” file manually based on which commit the "while loop” broke.

        ~/D/restfulSE ❯❯❯ while read commit; do
                                     git cherry-pick $commit
                                     done < ../commits_back


This process puts us at the beginning of RELEASE_3_7. There were more commits 
made which need to be cherry-picked from RELEASE_3_8. I repeated the process, 
to overlay commits. But before that, so that I can cherry-pick the latest 
commits to RELEASE_3_8 from a branch, /tmp/restfulSE is my location for the 
package’s most recent version (with all the duplicated commits from RELEASE_3_8 
/ master

         >> git remote add upstream /tmp/restfulSE  
        
         >> git fetch --all

        >> git log —oneline upstream/master > ../more_commits

        ## Manually deduplicate them,
        >> vim more_commits 

        ## reverse the order and get just the first column,
        >> cat more_commits| awk -F" " '{print $1}' | tail -r > 
~/Documents/more_commits_back


Overlay all the new commits, while battling with conflicts. (There were plenty 
of conflicts in this)

                                >> while read commit; do
                                     git cherry-pick $commit
                                     done < ../more_commits_back


Once the commits were overlaid, I had to do the release process (RELEASE_3_8) 
again for restfulSE,

        >> git reset --soft 6899c3f 

        >> git branch RELEASE_3_8

        >> git add DESCRIPTION

        >> git commit -m "bump x.y.z versions to odd y after creation of 
RELEASE_3_8 branch”


I removed any large “.rda” files added after adding commits from RELEASE_3_8 
again, using BFG cleaner,

        >> java -jar ~/Downloads/bfg-1.13.0.jar --strip-blobs-bigger-than 5M 
restfulSE

        >> git reflog expire --expire=now --all && git gc --prune=now 
--aggressive


Then, add the appropriate remotes for git.bioconductor.org and force push.

This now gives the package restfulSE a clean commit history in both the 
`master` and the `RELEASE_3_8` branches. This of course does not guarantee that 
the package works as intended and the maintainer should take all precautions to 
fix it from this point forward. 

And as a reminder to authors, please follow the instructions on 
http://bioconductor.org/developers/how-to/git/. If you have any questions 
please ask on the bioc-devel mailing list. It is much easier to answer a 
question before , rather than having to manually fix a repository. 

The main take away from this is that, it is extremely tedious to fix commit 
histories. If you have any questions about how I did this, I did the best I 
could to document the process and save original copies of the original 
repositories with all their issues. 

Best,

Nitesh 


P.S: Just kidding, there can’t be more than one :D (Follow best practices. )

> On Oct 31, 2018, at 3:38 PM, Turaga, Nitesh <nitesh.tur...@roswellpark.org> 
> wrote:
> 
> Hi Shweta,
> 
> Please hold off making anymore commits to restfulSE. I’ve noticed some 
> discrepancies in your package pre-release and post-release. I’ll try to 
> correct it the best I can before letting you know. 
> 
> The issues seem to be two-fold, duplicate commits and unusually large file 
> package. Your BFG cleaning as we spoke off line succeeded in making the 
> package size smaller, but it seems to have induced more issues as far as 
> contamination of commit history goes. 
> 
> I’ll work on this in the next day or so, and let you know. 
> 
> You can then sync a fresh copy of the package on the Bioconductor git server.
> 
> Best,
> 
> Nitesh 
> 
> 
>> On Oct 30, 2018, at 10:32 AM, Shweta Gopaulakrishnan 
>> <re...@channing.harvard.edu> wrote:
>> 
>> Hi Nitesh, 
>> 
>> Hope you are doing good ! I am working with bfg to reduce the size of 
>> "restfulSE" package. I am able to do so with the github repository but not 
>> with the bioconductor repository. 
>> 
>> The steps I am following for the bioconductor one is :
>> 
>>> git clone --mirror g...@git.bioconductor.org:packages/restfulSE
>>> java -jar bfg.jar --strip-blobs-bigger-than 30M restfulSE.git
>>> cd restfulSE.git
>>> git reflog expire --expire=now --all && git gc --prune=now --aggressive
>>> git push 
>> 
>> I get an error : fatal: This operation must be run in a work tree 
>> 
>> Is there any other way to push changes upstream after bfg ? 
>> 
>> Thank you! 
>> -- 
>> Shweta Gopaulakrishnan,
>> Bioinformatician,
>> Channing Division of Network Medicine,
>> Brigham and Women's Health Hospital,
>> Boston,MA 02115
>> 
>> The information in this e-mail is intended only for the person to whom it is
>> addressed. If you believe this e-mail was sent to you in error and the e-mail
>> contains patient information, please contact the Partners Compliance 
>> HelpLine at
>> http://www.partners.org/complianceline . If the e-mail was sent to you in 
>> error
>> but does not contain patient information, please contact the sender and 
>> properly
>> dispose of the e-mail.
> 



This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to