Re: [Bioc-devel] A handful of check to follow up on R CMD BiocCheck

Martin Morgan Wed, 09 Nov 2016 15:33:46 -0800

On 11/03/2016 08:14 AM, Kevin RUE wrote:

Apologies for the additional spam, for two reasons:


  * The diff files that I've previously sent had the base and modified
    versions swapped. This new one fixes that.
  * This new diff file (always relative to the code I cloned from
    Bioconductor-mirror) also fixes a bug whereby the updated code would
    fail on packages that do not break any of the three guidelines.


Hi Kevin -- thanks for these, I incorporated a version in 1.11.1.

The report is for the first 6 lines of offenders; the code for findingall offenders is mentioned in the vignette


  BiocCheck:::checkFormatting("path/to/YourPackage", nlines=Inf)

Marcel Ramos mentions the lintr (CRAN) and goodpractice (github only,https://github.com/MangoTheCat/goodpractice) packages as usefulresources for these types of questions; also formatR (CRAN).

For the record, it was fun to work through your code. The basic patternwas to read lines and flag those that violated a (usually simple) test,add these to a data frame, then if after parsing all files the dataframe had any rows report these to the user.

I standardized the columns of the data frame across checks to facilitatere-use, and imposed the standard by using a constructor -- non-exportedContext(). I took a little bit of pain to make the constructor work(return a data frame with the correct columns but no rows) when thereare no lines of code flagged to be added to the context, so that I coulduse it without have to insert a bunch of conditional `if (...)` in mycode that make it both harder to understand and harder to test. Also,the constructor is vectorized, whereas your implementation was iterative.

For reporting to the user, I wrote a small helper handleContext() that Icould re-use across the different types of checks (and likely elsewherein the package) and also maintain the separation of what is reported tothe user (e.g., in handleContext()) from how it is reported to the user-- .msg, .verbatim, etc in the utils.R file. Probably the reportingcould have been refactored further.

I feel guilty about not writing unit tests; the package does includetests, including a handy package-generator function to make packagesthat are invalid in various ways. It seems like a better way to go wouldbe to further refactor the body of the checkFormatting() function sothat one could for instance invoke a function checkLineLengths(pkg,file, lines) and get back a data.frame created by Context(); it wouldthen be easy to test these for various 'lines' without having to figureout how to make test packages. But this seemed of course like too muchadditional work for this feature (I'll probably regret not expending theeffort at a future date).


I also found myself using the horrible 'copy-and-append' paradigm

   ctxt = Context()
   for (fl in files) {
       ...
       ctxt = rbind(ctxt, Context(...)
    }

which makes a copy of (the increasingly large) ctxt each time throughthe loop and therefore scales quadratically (when each file has someproblems) with the number of files. I'm counting on the number of filesto be small (e.g., less than 10000, when the copy-and-append patternstarts to be expensive even for trivial operations; mzR has ~8750 files,though many of these are not checked for formatting) so that thispattern does not become a bottleneck. It was still painful to use...


Martin


Best,
Kevin


On Thu, Nov 3, 2016 at 11:49 AM, Kevin RUE <kevinru...@gmail.com
<mailto:kevinru...@gmail.com>> wrote:

    Hi all,

    Please find attached the diff relative to the code that I cloned
    from Bioconductor-mirror yesterday (please ignore the previous diff
    file).

    Basically three new features:

      * As per previous email: display up to the first 6 lines that are
        over 80 characters long
      * *New*: display up to the first 6 lines that are not indented by
        a multiple of 4 spaces
      * *New*: display up to the first 6 lines that use TAB instead of 4
        spaces for indentation

    I also attach the output of the updated code
    <https://github.com/kevinrue/BiocCheck>, to illustrate the changes.

    Notes:

      * For demonstration purpose, I indented a handful of lines from
        the checks.R file itself with TAB characters. I assume that's
        OK, as some lines were already longer than 80 characters and not
        indented by a multiple of 4 spaces.

    All the best,
    Kevin


    On Wed, Nov 2, 2016 at 10:00 PM, Kevin RUE <kevinru...@gmail.com
    <mailto:kevinru...@gmail.com>> wrote:

        Me again :)

        Please find attached the first patch to print the first 6 lines
        over 80 characters long. (I'll get to the tabulation offenders
        next).

        Note that all the offending lines are stored in the "df.length"
        data.frame. How about an option like "fullReport=c(FALSE, TRUE)"
        that print *all* the offending lines?
        The data.frame also stores the content of the lines for the
        record, but does not print them. I think Kasper is right:
        filename and line should be enough to track down the line.

        All the best,
        Kevin



        On Wed, Nov 2, 2016 at 8:08 PM, Kevin RUE <kevinru...@gmail.com
        <mailto:kevinru...@gmail.com>> wrote:

            Thanks for the feedback!

            I also tend to prefer *all* the lines being reported (or to
            be honest, that was really true when I had lots of them; a
            problem that I largely mitigated by fixing all of them once
            and subsequently paying more attention while developing).

            Printing the content of the offending line somewhat helps me
            spot the line faster (more so for tab issues). But I must
            admit that showing the whole line is somewhat "overkill". I
            just started thinking of a compromise being to only show the
            first N characters of the line, with N being 80 minus the
            number of characters necessary to print the filename and
            line number.

            Thanks Martin for pointing out the lines in BiocCheck. (Now
            I feel bad for not having checked sooner.. hehe!)
            I think the idea of BiocCheck showing the first 6 offenders
            in BiocCheck quite nice, as I rarely have more since I use
            using the RStudio "Tools > Global Options > Code > Display >
            Show Margin > Margin column: 80" feature.

            I'll give a go at both approaches (developing BiocCheck and
            my own scripts)

            Cheers,
            Kevin


            On Wed, Nov 2, 2016 at 7:41 PM, Kasper Daniel Hansen
            <kasperdanielhan...@gmail.com
            <mailto:kasperdanielhan...@gmail.com>> wrote:

                I would prefer all line numbers reported, but on the
                other hand I am indifferent wrt. the content of the
                line, unless (say) TABs are marked up somehow.

                Kasper

                On Wed, Nov 2, 2016 at 3:17 PM, Martin Morgan
                <martin.mor...@roswellpark.org
                <mailto:martin.mor...@roswellpark.org>> wrote:

                    On 11/02/2016 02:49 PM, Kevin RUE wrote:

                        Dear all,

                        Just thought I'd share a handful of scripts that
                        I wrote to follow up on
                        certain NOTE messages thrown by R CMD BiocCheck.

                        https://github.com/kevinrue/BiocCheckTools
                        <https://github.com/kevinrue/BiocCheckTools>

                        They're very simple, but I occasionally find
                        them quite convenient.
                        Apologies if something similar already exists
                        somewhere :)


                    Maybe consider creating a diff against the source
                    code that, e.g., reported the first 6 offenders? The
                    relevant lines are near

                    
https://github.com/Bioconductor-mirror/BiocCheck/blob/master/R/checks.R#L1081
                    
<https://github.com/Bioconductor-mirror/BiocCheck/blob/master/R/checks.R#L1081>

                    Martin


                        All the best,
                        Kevin

                                [[alternative HTML version deleted]]

                        _______________________________________________
                        Bioc-devel@r-project.org
                        <mailto:Bioc-devel@r-project.org> mailing list
                        https://stat.ethz.ch/mailman/listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>



                    This email message may contain legally privileged
                    and/or...{{dropped:2}}


                    _______________________________________________
                    Bioc-devel@r-project.org
                    <mailto:Bioc-devel@r-project.org> mailing list
                    https://stat.ethz.ch/mailman/listinfo/bioc-devel
                    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>



This email message may contain legally privileged and/or...{{dropped:2}}

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] A handful of check to follow up on R CMD BiocCheck

Reply via email to