Re: [sword-devel] Script to find a best fit v11n

DM Smith Thu, 19 Jun 2025 14:22:09 -0700

> On Jun 19, 2025, at 3:24 PM, Greg Hellings <greg.helli...@gmail.com> wrote:
> 
> 
> 
> On Thu, Jun 19, 2025 at 9:07 AM DM Smith <dmsm...@crosswire.org 
> <mailto:dmsm...@crosswire.org>> wrote:
>> Greg,
>> There’s an extraneous %s in the output.
> 
> Ah, not surprising. That is the old, Python 2 way of formatting variables 
> into a string, similar to C style printf syntax with variable arguments 
> coming in a tuple after an overload of the modulus operator (so it would look 
> like `"this is a string: %s" % (a_string, )` ). The modern preferred way is 
> with an f-string, where you preface a string with the character `f` and then 
> reference variables in the string with {variable_name} syntax (e.g. `f"this 
> is a string: {a_string}"`). That %s can be killed off, or replaced with an 
> f-string equivalent.
>  
>> 
>> If you put the enumeration after the line "There are 93 OT IDs and 5 NT IDs 
>> in v11n which aren’t in your file.” Then you wouldn’t need the heading "The 
>> following IDs don’t appear in your file:”
> 
> Yeah, I had been putting the IDs out to stderr with the logging utility 
> previously. It was only yesterday when I was squashing the remaining Python 3 
> compat issues that I realized I should just drop them into a print statement. 
> They are, thusly, kinda crazy. In fact, I pass them through a `sort` call, so 
> they won't be in either canonical or document order - unless the document has 
> its verses sorted alphabetically by osisID attribute for some inexplicable 
> reason.
>  
>> It’d also be nice to format it a few per line, indented appropriately.
> 
> Perhaps broken up by book? Or by book/chapter So it's like
> Verses missing from:
> Gen
>    1 - 1, 3, 5, 7
>    2 - 11, 22
> Exo
>   27 - 1
> 
> There is a long way to go to improve the output, especially of this detail 
> portion. It was, after all, only intended as debugging output for me while I 
> was writing it.
>  
>> 
>> I’d be happy to iterate over any suggestions we agree on.
> 
> As I am not a user of it, nor an intended consumer of it, feel free to 
> improve it as needed! I quickly hacked it together and tossed it out into the 
> world at someone's off-handed request. I don't create modules, though, so I 
> have no vested interest in preserving its current operation in any particular 
> form. And, if this thread has shown anything, it's that likely Peter has been 
> the only user to date. So I doubt you'll disturb anyone else with it.
> 
> If you need my support for anything, I'm happy to lend a hand.
> 
> Pulling in comments from your other email on this thread:
> 
> > I like that it's very simple to read. Having a summary is good. And the 
> > other email which lists the exact ids extra/missing per testament is very 
> > helpful.
> > I think that enumerating the names of the extra/missing books and 
> > extra/missing chapters would be good. No sense in enumerating the ids 
> > within these.
> 
> That probably would be good. I didn't include detection for an entire missing 
> chapter or book, but it shouldn't be too terribly difficult to enhance it 
> with that. A simple brute force check of every detected missing book or 
> chapter to see if there are any matched verses can reveal that pretty easily.
> 
> > I ran mine against an input that was a test case for osis2mod’s infinite 
> > loop and it had 2 extra books and 13 extra chapters. This wouldn’t be 
> > obvious in your results.
> 
> True, mine would just complain about hundreds or even thousands of mismatches 
> and silently swallow the list of what those are. I had a few of those that I 
> omitted from the sample output I captured. For instance, there are large 
> portions of the canon for the Catholic versifications missing from the KJV 
> file. It just lists of something absurd like "There are 4,741 missing verses" 
> or whatever it is.
> 
> > Is it an advantage or disadvantage to be compiled against SWORD lib vs 
> > slurping header files?
> 
> Like most things, it's a trade-off. Working with the bindings requires that 
> the Sword bindings are installed on the host system. For someone running on 
> Windows, this is particularly non-trivial. For someone running in macOS it's 
> not too difficult to install from source (I don't believe Homebrew builds 
> them). For users of major Linux distributions, it's downright trivial. On 
> Fedora it's as simple as a single `dnf install python3-sword` command for a 
> long time now, and it looks like the bindings are also available for Ubuntu 
> starting in 25.04 with an `apt install python3-sword` as well.


Regarding building SWORD on a Mac, I use homebrew for extra packages. I tried 
to run ./autogen.sh, but it failed on libtoolize, which homebrew doesn’t have. 
Then I ran cmake, which failed because icu4c required C++17 or better. Hacking 
that I got CMakeLists.txt, I got it to work. I’ll see if I can use that to run 
your script.

> Advantages of the binding method are that it doesn't rely on parsing a C 
> header file, nor on the file laying out the values in a certain way. It also 
> can be used offline easily, doesn't require parsing the output of HTML in 
> order to find all the applicable files, and is likely slightly faster. Not 
> that the speed probably matters for a single run of this, but if you're bulk 
> processing files the speed advantages can add up.

The way I wrote mine is that it could use the include/canon*.h files from a 
prior local SVN clone. This is very fast. I’d be curious to see how it differs 
in speed from yours. The default is to go against the web, which is painfully 
slow. (Note, it doesn’t yet do the standard disclaimer for the web.) Not big 
deal if it is a single run. Peter mentioned that he does additional analysis of 
the files in problematic areas that cannot be done by the script.

Using the python bindings does have the advantages of not re-inventing the 
wheel. I was impressed with chatGPT’s regular expressions to slurp the arrays 
and how concise it was to read the files. There really wasn’t any difficulty in 
parsing the files. Since the canon*.h files are very static and not likely to 
affect the parse. I don’t think this is that big a deal.


> 
> Disadvantages of the binding method are that it's requiring you to revert 
> back to a source build if you are using this to test a canon.h file or if you 
> want to use a canon file that isn't available in the package manager of your 
> Linux distribution. Building from source isn't terribly onerous for most of 
> us contributors but it might be more of a problem for a module maintainer. 
> Then again, how often do we add a new versification to the code base?

 So, it’s not something we’d expect a module maker to succeed at if not on 
Un*x. Maybe someone has a library release for the MacOS or Windows that could 
be used?

> 
> So there are pros and cons between them. I was freshly off of getting the 
> bindings to compile when I wrote the first draft of av11n.py so I naturally 
> went that direction. I also try to avoid writing parsers when I can leverage 
> existing ones, as grammars can be notoriously complex to get correct. So that 
> dictated my choices as much as did anything else, really!

My computer science masters degree was in compiler writing! It’s definitely not 
for the faint of heart!

> 
> Another possible enhancement might be a CLI flag to limit the testing range 
> to a particular book (or testament) at a time. I have heard people talk about 
> having modules split up to one book per file or similar. If they could say, 
> "Only check this file against Joshua" then it could keep down a significant 
> amount of extra output. But again - I'm not really an intended user of it!

Great idea. So David’s suggestion of a scope argument.

And I’m not an intended user of it either. I’m just trying to get people to use 
something other than osis2mod to pick a versification. Looking at the Jira 
issues on osis2mod, in one issue a person listed their script that looped over 
the v11ns and called osis2mod with each. Yuck!

> 
> --Greg
> 
>> 
>> DM
>> 
>>> On Jun 19, 2025, at 12:12 AM, Greg Hellings <greg.helli...@gmail.com 
>>> <mailto:greg.helli...@gmail.com>> wrote:
>>> 
>>> And here's an example now that I've fixed the output of the osisIDs when 
>>> there are fewer than 100 of them:
>>> 
>>> [vagrant@localhost ~]$ ./av11n.py kjv.osis.xml                              
>>>                                                    
>>>                                                                             
>>>                                                    
>>> Checking Calvin:
>>> ----------------   
>>>         The following IDs don’t appear in your file:
>>> %s 1Kgs.22.54, 1Sam.20.43, 1Sam.24.23, 3John.1.15, Acts.24.28, Eccl.12.15, 
>>> Eccl.12.16, Ezek.21.33, Ezek.21.34, Ezek.21.35, Ezek.21.36, Ezek.21.37, 
>>> Hos.12.15, Isa.8.23, Job.39.31, Job.39.32, Job.39.33, Job.39.34, Job.39.35, 
>>> Job.39.36, Job.39.37, Job.39.38
>>> , Job.40.25, Job.40.26, Job.40.27, Job.40.28, Jonah.2.11, Mark.10.53, 
>>> Mark.9.51, Num.13.34, Num.30.17, Ps.102.29, Ps.108.14, Ps.12.9, Ps.140.14, 
>>> Ps.142.8, Ps.18.51, Ps.19.15, Ps.20.10, Ps.21.14, Ps.22.32, Ps.3.9, 
>>> Ps.30.13, Ps.31.25, Ps.34.23, Ps.36.13, P
>>> s.38.23, Ps.39.14, Ps.4.9, Ps.40.18, Ps.41.14, Ps.42.12, Ps.44.27, 
>>> Ps.45.18, Ps.46.12, Ps.47.10, Ps.48.15, Ps.49.21, Ps.5.13, Ps.51.20, 
>>> Ps.51.21, Ps.52.10, Ps.52.11, Ps.53.7, Ps.54.8, Ps.54.9, Ps.55.24, 
>>> Ps.56.14, Ps.57.12, Ps.58.12, Ps.59.18, Ps.6.11, Ps
>>> .60.13, Ps.60.14, Ps.61.9, Ps.62.13, Ps.63.12, Ps.64.11, Ps.65.14, Ps.67.8, 
>>> Ps.68.36, Ps.69.37, Ps.7.18, Ps.70.6, Ps.75.11, Ps.76.13, Ps.77.21, 
>>> Ps.8.10, Ps.80.20, Ps.81.17, Ps.83.19, Ps.84.13, Ps.85.14, Ps.88.19, 
>>> Ps.89.53, Ps.9.21, Ps.92.16, Rev.12.18
>>>         There are 93 OT IDs and 5 NT IDs in v11n which aren’t in your file.
>>>         The following IDs don’t appear in v11n:                             
>>>                                                    
>>> %s 1Kgs.22.54, 1Sam.20.43, 1Sam.24.23, 3John.1.15, Acts.24.28, Eccl.12.15, 
>>> Eccl.12.16, Ezek.21.33, Ezek.21.34, Ezek.21.35, Ezek.21.36, Ezek.21.37, 
>>> Hos.12.15, Isa.8.23, Job.39.31, Job.39.32, Job.39.33, Job.39.34, Job.39.35, 
>>> Job.39.36, Job.39.37, Job.39.38
>>> , Job.40.25, Job.40.26, Job.40.27, Job.40.28, Jonah.2.11, Mark.10.53, 
>>> Mark.9.51, Num.13.34, Num.30.17, Ps.102.29, Ps.108.14, Ps.12.9, Ps.140.14, 
>>> Ps.142.8, Ps.18.51, Ps.19.15, Ps.20.10, Ps.21.14, Ps.22.32, Ps.3.9, 
>>> Ps.30.13, Ps.31.25, Ps.34.23, Ps.36.13, P
>>> s.38.23, Ps.39.14, Ps.4.9, Ps.40.18, Ps.41.14, Ps.42.12, Ps.44.27, 
>>> Ps.45.18, Ps.46.12, Ps.47.10, Ps.48.15, Ps.49.21, Ps.5.13, Ps.51.20, 
>>> Ps.51.21, Ps.52.10, Ps.52.11, Ps.53.7, Ps.54.8, Ps.54.9, Ps.55.24, 
>>> Ps.56.14, Ps.57.12, Ps.58.12, Ps.59.18, Ps.6.11, Ps
>>> .60.13, Ps.60.14, Ps.61.9, Ps.62.13, Ps.63.12, Ps.64.11, Ps.65.14, Ps.67.8, 
>>> Ps.68.36, Ps.69.37, Ps.7.18, Ps.70.6, Ps.75.11, Ps.76.13, Ps.77.21, 
>>> Ps.8.10, Ps.80.20, Ps.81.17, Ps.83.19, Ps.84.13, Ps.85.14, Ps.88.19, 
>>> Ps.89.53, Ps.9.21, Ps.92.16, Rev.12.18
>>>         There are 1 OT IDs and 29 NT IDs in your file which don’t appear in 
>>> v11n.
>>> 
>>> 
>>> On Wed, Jun 18, 2025 at 11:00 PM Greg Hellings <greg.helli...@gmail.com 
>>> <mailto:greg.helli...@gmail.com>> wrote:
>>>> Here is an example of the first lines of running my script against the 
>>>> kjv.osis.xml file from the git repo:
>>>> 
>>>> 
>>>> Checking Calvin:
>>>> ----------------
>>>>         There are 93 OT IDs and 5 NT IDs in v11n which aren’t in your file.
>>>>         There are 0 OT IDs and 30 NT IDs in your file which don’t appear 
>>>> in v11n.
>>>> 
>>>> Checking Catholic:
>>>> ------------------
>>>>         There are 4530 OT IDs and 3 NT IDs in v11n which aren’t in your 
>>>> file.
>>>>         There are 0 OT IDs and 133 NT IDs in your file which don’t appear 
>>>> in v11n.
>>>> 
>>>> Checking Catholic2:
>>>> -------------------
>>>>         There are 4638 OT IDs and 3 NT IDs in v11n which aren’t in your 
>>>> file.
>>>>         There are 0 OT IDs and 133 NT IDs in your file which don’t appear 
>>>> in v11n.
>>>> 
>>>> Checking DarbyFr:
>>>> -----------------
>>>>         There are 31 OT IDs and 4 NT IDs in v11n which aren’t in your file.
>>>>         There are 0 OT IDs and 30 NT IDs in your file which don’t appear 
>>>> in v11n.
>>>> 
>>>> This continues on to include such output as
>>>> 
>>>>                                                                            
>>>>                                                     
>>>> Checking KJV:
>>>> ------------- 
>>>>         Your file has all the references in this v11n
>>>>         Your file has no extra references                                  
>>>>                                                     
>>>>                                                                            
>>>>                                                     
>>>> Checking KJVA:         
>>>> --------------
>>>>         There are 5717 OT IDs and 0 NT IDs in v11n which aren’t in your 
>>>> file.
>>>>         Your file has no extra references
>>>> 
>>>> giving a clear example of a winner for this particular file.
>>>> 
>>>> Meanwhile, running it against the kjva.osis.xml file includes this in the 
>>>> results:
>>>> 
>>>> ...
>>>> 
>>>> Checking KJV:        
>>>> -------------        
>>>>         Your file has all the references in this v11n
>>>>         There are 2 OT IDs and 5715 NT IDs in your file which don’t appear 
>>>> in v11n.
>>>>                                                                
>>>> Checking KJVA:                                                             
>>>>                                                     
>>>> --------------                                                             
>>>>                                                     
>>>>         Your file has all the references in this v11n
>>>>         Your file has no extra references
>>>> ...
>>>> 
>>>> Fiddling with the file has showed me there are a couple of places where I 
>>>> need to tweak it for Python 3 compatibility that I missed the last time I 
>>>> updated. But fixing those couple of little syntax issues resulted in it 
>>>> running just fine in a Fedora 41 vm with nothing more to do than invoke 
>>>> `dnf install python3-sword` to setup the system to use it.
>>>> 
>>>> --Greg
>>>> 
>>>> On Wed, Jun 18, 2025 at 10:40 PM Greg Hellings <greg.helli...@gmail.com 
>>>> <mailto:greg.helli...@gmail.com>> wrote:
>>>>> My script eschews percentages because they seemed relatively pointless to 
>>>>> me for measuring a mismatch like this. Instead it gives a count of both 
>>>>> Old and New Testament osisIDs that it finds missing and another that it 
>>>>> finds unexpectedly for a given versification. If the total of either 
>>>>> count is fewer than 100, the IDs for that particular count are printed to 
>>>>> the console. It will do this for every registered versification in the 
>>>>> version of the library it was compiled against, allowing the user to 
>>>>> select whichever one seems best to them based on the results.
>>>>> 
>>>>> On Wed, Jun 18, 2025, 10:25 PM David Haslam <dfh...@protonmail.com 
>>>>> <mailto:dfh...@protonmail.com>> wrote:
>>>>>> It’s not just the number of “missing” verses that should figure in the 
>>>>>> percentage score, but also the number of verses that get concatenated to 
>>>>>> the last one in a chapter.
>>>>>> 
>>>>>> The differences in v11n for the Psalms will be especially significant 
>>>>>> for this, in that some v11n renumber many of them. Likewise for the last 
>>>>>> few chapters in the book of Job.
>>>>>> 
>>>>>> Aside: It would be cool to enhance the utility emptyvss by providing a 
>>>>>> command line option that would ignore books that are not included in the 
>>>>>> scope parameter in the conf file.
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> David
>>>>>> 
>>>>>> On Thu, Jun 19, 2025 at 03:18, DM Smith <dmsm...@crosswire.org 
>>>>>> <mailto:On+Thu,+Jun+19,+2025+at+03:18,+DM+Smith+%3C%3Ca+href=>> wrote:
>>>>>>> 
>>>>>>> David,
>>>>>>> 
>>>>>>> Because it only considers the xml, scope is automatically built into 
>>>>>>> it. It is only comparing what is present in the xml with what is part 
>>>>>>> of the av11ns. 
>>>>>>> 
>>>>>>> It might be good to add the enumeration of missing verses.
>>>>>>> 
>>>>>>> — DM
>>>>>>> 
>>>>>>>> On Jun 18, 2025, at 4:02 PM, David Haslam <dfh...@protonmail.com 
>>>>>>>> <mailto:dfh...@protonmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Does it take account of the Scope key in the .conf file for a less 
>>>>>>>> than complete Bible ?
>>>>>>>> 
>>>>>>>> David
>>>>>>>> 
>>>>>>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Jun 18, 2025 at 20:51, DM Smith < dmsm...@crosswire.org 
>>>>>>>> <mailto:On+Wed,+Jun+18,+2025+at+20:51,+DM+Smith+%3C%3Ca+href=>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Several have commented on how hard it is to test an OSIS xml file 
>>>>>>>>> against v11ns especially since it goes off into an infinite loop. 
>>>>>>>>> (I’ve posted a patch that fixes that) But it is still a process of 
>>>>>>>>> trial and error to find an appropriate v11n.
>>>>>>>>> 
>>>>>>>>> So, I’ve been iterating with chatGPT to create a python script to 
>>>>>>>>> find a best fit v11n. Since I don’t know python, I can’t vouch for 
>>>>>>>>> the script beyond it worked for a simple test case that had an extra 
>>>>>>>>> chapter for Genesis and had some extra verses at the end of a chapter 
>>>>>>>>> in that book.
>>>>>>>>> 
>>>>>>>>> I offer it, as a starting place. See the attached file.
>>>>>>>>> 
>>>>>>>>> It has a —debug flag.
>>>>>>>>> The first argument is expected to be the OSIS xml file.
>>>>>>>>> The second argument is optional and gives the location to the include 
>>>>>>>>> directory of svn/sword/trunk/include with all the canon*.h files. If 
>>>>>>>>> you don’t supply the argument, it uses the web to load the canon*.h 
>>>>>>>>> files from https://www.crosswire.org/svn/sword/trunk/include. 
>>>>>>>>> 
>>>>>>>>> It will score the fitness of each of the v11ns. It gives the score as 
>>>>>>>>> a %, but I don’t know what that means. I told it that it should 
>>>>>>>>> prioritize book matches, then chapter matches and finally verse 
>>>>>>>>> matches. I don’t know how well it did that scoring. I didn’t test for 
>>>>>>>>> that.
>>>>>>>>> 
>>>>>>>>> The output is alphabetized. If more than one v11n have the same high 
>>>>>>>>> score, they are listed.
>>>>>>>>> 
>>>>>>>>> In His Service,
>>>>>>>>>  DM
>>>>>>>>> 
>>>>>>>> _______________________________________________ 
>>>>>>>> sword-devel mailing list: sword-devel@crosswire.org 
>>>>>>>> <mailto:sword-devel@crosswire.org> 
>>>>>>>> http://crosswire.org/mailman/listinfo/sword-devel 
>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> sword-devel mailing list: sword-devel@crosswire.org 
>>>>>> <mailto:sword-devel@crosswire.org>
>>>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>>>> Instructions to unsubscribe/change your settings at above page
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel@crosswire.org 
>>> <mailto:sword-devel@crosswire.org>
>>> http://crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>> 
>> _______________________________________________
>> sword-devel mailing list: sword-devel@crosswire.org 
>> <mailto:sword-devel@crosswire.org>
>> http://crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
> _______________________________________________
> sword-devel mailing list: sword-devel@crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Script to find a best fit v11n

Reply via email to