Hi Eric.

The Journal site (journal.code4lib.org) is a lightly modified WordPress site, 
and the indexing is whatever comes with WordPress. (I would guess it renders 
the HTML to flat text with no regard for authorship and reference sections.)  
The issue is a WordPress category, the date is the WordPress post date (I 
think), and Title is the WordPress title.  Author is a field we added to 
WordPress, and it is just a text field (authors are undistinguished in the 
field).  Abstract is the WordPress summary.  I think the RSS feed from the 
Journal might be a good place to get much of the information, although in some 
cases (like Author), further processing would be required.  We also submit 
metadata to DOAJ (https://doaj.org/toc/1940-5758), the basis of which comes 
from a custom plugin; see, for example, 
http://journal.code4lib.org/issues/issue44/feed/doaj. (The coordinating editor 
downloads that file, manually checks/corrects XML errors, and uploads it too 
DOAJ.)

Hope this helps -- sounds like you are doing some interesting work!


Peter
On May 16, 2019, 1:12 PM -0400, Eric Lease Morgan <[email protected]>, wrote:
> How is Code4Lib Journal indexed? What software is used, and more 
> specifically, what characteristics of each article are included in the index?
>
> Our journal is pretty cool, but as a library-related journal, I think it can 
> be better. For example, what are the various indexed fields? Maybe we can 
> support faceted browsing? Search results are returned in a very narrative 
> form -- a format this is not very computable. If search results were in some 
> sort of columnar format (TSV, CSV, etc.) sorting and grouping would be 
> possible as well as analysis.
>
> Recently, I have been playing a lot with natural language processing and this 
> has resulted in the extraction of statistically significant keywords, named 
> entities, parts-of-speech, and even the identification of sentences matching 
> a given grammar. All of these things lend themselves to inputs for machine 
> learning processes. In turn, the results of all these things can 
> re-incorporated into an index of Code4Lib. Thus the index not only supports 
> find & get but also analysis. For a good time, I'd like to give this a go, 
> just as an experiment.
>
> Is there someplace where I can download a rudimentary metadata file of all 
> Code4Lib articles? At the least, I hope such a metadata file includes fields 
> such as:
>
> * author(s)
> * title
> * date
> * abstract
> * link to full text
> * issue
>
> Is there a place where I can get such metadata?
>
> --
> Eric Morgan

Reply via email to