DM, thanks for the script! It's exactly useful - just what I needed. Again, very much appreciated!
~A On Fri, Sep 21, 2012 at 3:12 PM, DM Smith <dmsm...@crosswire.org> wrote: > Here is a simple perl script that will check many xml files for errors > (following the assumptions listed below). I think the diagnostics are > relatively easy to understand. > > #!/usr/bin/perl > # BSD License > > use strict; > > my $lineNum = 0; > my $element = ""; > my $tagName = ""; > my @tagStack = (); > lines: while (<>) { > $lineNum++; > > # While there is a tag on the line > # remove and process it > while (s/<([^>]+)>//o) { > $element = $1; > > # self closed tags are skipped > if ($element =~ /(.*\/|\?.*\?)$/) { > next; > } > > # end tags have to nest properly > # thus match stack top > if ($element =~ /^\/([^\s]+).*$/o) { > $tagName = $1; > my ($topTagName, $topLineNum, $topElement) = @{ pop @tagStack }; > if ($topTagName ne $tagName) { > print "Error on line $lineNum: expected $tagName, but saw $topTagName from > line $topLineNum (element: <$topElement>)\n"; > last lines; > } > } else { > # Found a start element > $element =~ /^([^\s]+).*$/o; > $tagName = $1; > push @tagStack, [ $tagName, $lineNum, $element ]; > } > } > } > > foreach my $location (reverse @tagStack) { > my ($topTagName, $topLineNum, $topElement) = @{$location}; > print "unmatched $topTagName from line $topLineNum (element: > <$topElement>)\n"; > } > > On Sep 21, 2012, at 1:27 PM, DM Smith <dmsm...@crosswire.org> wrote: > > So far the discussion is around whether the xml is well-formed. > Once you get that working, then you need to make sure it is valid wrt the > OSIS schema. > > There's an old tool that will convert sgml to well-formed xml. I think it > was James Clark's "sx". I've used it successfully on initial conversions and > getting something that will work within xml tools. > > Finally, OSIS has the notion of milestones for start and end elements. There > are semantic rules regarding this that cannot be checked by standard xml > tools. Osis2mod tries to handle this. When you get to that point, I can help > unravel the logging options. > > The purpose of milestoned elements is to allow for two competing document > models to be in the same xml document: BSP and BCV (names we've given it > here and in the wiki). > > We recommend using BSP (book, chapter, section, paragraph, poetry, lists to > all be containers, not milestoned) and verse elements be milestoned. > > Note, the OSIS manual says that if you have one element milestoned, then all > other elements with the same tag name have to be milestoned. Practically > speaking, this does not matter. SWORD and JSword don't care. Having verses > milestoned only if necessary is probably a better way to create a good XML > document. Start out with all of them as containers and each place where that > causes a problem, either fix the xml or if otherwise correct, convert to > milestoned verses. > > Generally speaking these BSP elements should not start just inside or at the > end of a verse. Rather they should be between verse elements or within the > text. When they are placed just after the verse start, they often will cause > the verse number to be orphaned. When they are placed just before the verse > end, then it is generally not noticeable (just bad form). > > Quotes will create the biggest grief in the above. They often cross > boundaries. Certainly, the beatitudes does, starting in one chapter and > ending a couple of chapters later. For this reason, using the milestoned > version is necessary. > > If you're document follows some simple rules (some required by xml, others > simplifications), then checking nesting is a simple matter of having a > push/pop stack of elements. The simple rules: > 1) All attributes when present have quoted values. > 2) All entities are properly formed and used when needed. Also, < and > are > not in attribute values. > 3) Tags are marked with < ... >, </ ... >, or < ... />. and now new lines > between < and >. > > If this is true then a simple perl script can be written to find the > problems in the file: > Look for < ... /> and skip them. They cause no problems. > Look for < xxx ... > and push the tag name along with its location in the > file on to the stack. > Look for < xxx />, compare xxx to the top element on the stack. If it > doesn't match, then it causes an error. > When you get to the end of the document and the stack is not empty, then the > elements on the stack are not closed properly. > > Printing out the stack (elements and locations) would help find what the > problem is. > > For example: > if xxx is deeper in the stack, then there is a problem with nesting. Look at > all the elements above the xxx on the stack for problems. > if it is not in the stack, then the element was not started prior to that > point or it may have been ended twice. > > Here is a simple perl script (that I wrote), which doesn't do that, but > could be adapted to do it. This creates a histogram/dictionary of tag and > attribute names. > > #!/usr/bin/perl > > use strict; > > my %tags = (); > my %attrs = (); > while (<>) > { > #print; > # While there is a tag on the line > while (/<[^\/\s>]+[\/\s>]/o) > { > # While there is an attribute in the tag > while (/<[^\/\s>]+\s+[^\=\/\>]+=\"[^\"]+\"/o) > { > # remove the attribute > s/<([^\/\s>]+)\s+([^\=\/\>]+)(\="[^\"]+\")(.*)/<$1 $4/o; > my ($t, $a, $v, $r) = ($1, $2, $3, $4); > $attrs{"$t.$a"}++; > } > # remove the tag > s/<([^\/\s>]+)[\/\s>]//o; > $tags{$1}++; > #print("do next tag on line\n"); > } > #print("do next line\n"); > } > > foreach my $tag (sort keys %tags) > { > print("$tag\n"); > } > > foreach my $attr (sort keys %attrs) > { > print("$attr\n"); > } > > Hope this helps, > DM > > On Sep 21, 2012, at 10:52 AM, Andrew Thule <thules...@gmail.com> wrote: > > Thanks everyone for suggestions. I'll give them all a try. > > That said, the emacs recommendation is nearly a religious conversion > recommendation. (I'm on the vi side of the vi verses emacs debate. I > suppose as long as it doesn't kill me I should give it a try, though I'm not > certain what impact it will have on the health of my soul ... :D ) > > ~A > > > On Thursday, September 20, 2012, Daniel Owens wrote: >> >> I use jEdit with the XML plugin installed. I find it helps me find >> problems fairly easily. >> >> Daniel >> >> On 09/20/2012 05:26 PM, Greg Hellings wrote: >>> >>> There are a number of pieces of software out there that will >>> pretty-print the XML for you, with indenting and whatnot. Overly >>> indented for what you would want in production but decent for >>> debugging mismatching nesting and the like. >>> >>> For example, 'xmllint --format' will properly indent the file, etc. I >>> don't know how it will handle poorly formed XML. >>> >>> GUI editors can do wonders as well. On Windows I use Notepad++ and >>> manually set it to display XML. gEdit and Geany - I believe - both >>> support similar display worlds. And there are some plugins for Eclipse >>> that might handle what you need as well. >>> >>> --Greg >>> >>> On Thu, Sep 20, 2012 at 4:19 PM, Karl Kleinpaste <k...@kleinpaste.org> >>> wrote: >>>> >>>> Andrew Thule <thules...@gmail.com> writes: >>>>> >>>>> One of my least favour things is finding mismatched tags in OSIS.xml >>>>> files >>>>> Has anyone successfully climbed this summit? >>>> >>>> XEmacs and xml-mode (and font-lock-mode). M-C-f and M-C-b execute >>>> sgml-forward-element and -backward-. That is, sitting at the beginning >>>> of <tag>, M-C-f (meta-control-f) moves forward to the matching </tag>, >>>> properly handling nested tags. >>>> >>>> _______________________________________________ >>>> sword-devel mailing list: sword-devel@crosswire.org >>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>> Instructions to unsubscribe/change your settings at above page >>> >>> _______________________________________________ >>> sword-devel mailing list: sword-devel@crosswire.org >>> http://www.crosswire.org/mailman/listinfo/sword-devel >>> Instructions to unsubscribe/change your settings at above page >>> >> >> >> _______________________________________________ >> sword-devel mailing list: sword-devel@crosswire.org >> http://www.crosswire.org/mailman/listinfo/sword-devel >> Instructions to unsubscribe/change your settings at above page > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page