On 18/03/2021 22:08, H wrote:
> On 03/18/2021 04:30 PM, Paul Heinlein wrote:
>> On Thu, 18 Mar 2021, H wrote:
>>
>>> I have a challenge I am interested in getting feedback on.
>>>
>>> I will on a regular basis download a series of data files from the web 
>>> where the data is in XML-format. The format is known in advance but is 
>>> different between the various data files. I then plan to extract the 
>>> various data items ("elements?") from each data file, do some light 
>>> formatting and then save desired parts of each original data file as a 
>>> formatted CSV-file for later importing into a database.
>>>
>>> As the plan is to use a bash shell script using curl to get the files, I 
>>> have begun looking at external XML parsers that I can call from my script, 
>>> perhaps specify which elements I want, get the data back in some kind of 
>>> bash data structure and finally format and save as CSV-files.
>>>
>>> There seems to be a number of XML parsers available but perhaps someone on 
>>> the list has a recommendation for which one might suit my needs best? I 
>>> should add that I am running CentOS 7.
>>
>> Will you be using an XSLT stylesheet to do the work? There's a somewhat 
>> steep learning curve, but in my experience it's the most reliable method for 
>> parsing XML except in the very simplest of cases.
>>
>> In that case, the libxslt stuff may be what you want:
>>
>>   http://xmlsoft.org/libxslt/
>>
>> The command-line tool is xsltproc.
>>
>> Again, it's not easy to use, but once you've built a toolchain, it will be 
>> reliable and fairly easy to modify if the source XML schema change.
>>
> I just checked and I cannot see that the organization publishing these data 
> files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming 
> that the publisher of the data would be one with said stylesheet. (Although 
> perhaps that is something an end-user could put together as well??)
> 
> Although the data format of each data series is unique, it is simple and 
> could conceivably be parsed using grep but I am looking for a more 
> "forward-looking" solution for other applications in the future.
> 
> If XSLT stylesheets are not available - would you suggest another tool? Or, 
> would you suggest I design sheets, presumably one for for each data series?
> 

I used in the past xmlstarlet (available in epel) for quick parsing from
within bash scripts.
For something more robust, maybe switch to python ? (ymmv)

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
_______________________________________________
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos

Reply via email to