Bernardo

 

Being now retired, I do programing just for intellectual stimulation. Your 
problem looked as though it would provide more interest than cryptic crosswords 
or Sudoku, and it touches on areas of Pharo use that I have some experience 
with. So…

 

The attached file, BernardoDemo.st, shows how to use XMLHTMLParser with xPath 
and NeoJSON to tackle your problem – or at least a large subset of it. I 
cobbled it together in a Playground, and the easiest way to use it is to copy 
it into  a Playground and ‘do it and go’ for each block of code. There are 
liberal comments, but if anything is not clear come back to me.

 

A few caveats:

 

1.      XPath is a whole other programming language, embedded in Pharo, which 
takes some learning. I am by no means expert in it, and it may be that I have 
used it clumsily. One advantage of embedding it in Pharo is that you can 
intersperse Pharo and XPath, which I do whenever I can’t solve something 
entirely with XPath. Probably most of the places where I use #collect: followed 
by more XPath could be done entirely in XPath if I knew how.

2.      This is the first time I have tried to use NeoJSON, so do not take my 
code as an example of how to use it. It all works, as far as I can see. I 
cannot claim more than that.

3.      The easiest way to generate an object (or map) in NeoJSON is to start 
with a Pharo dictionary, which I have done everywhere. However, this means you 
have no control over the order in which the attributes appear in the JSON file. 
This is of no importance to a computer, since by definition the attributes are 
unordered, but it makes it a little odd to a human reader of the JSON.

4.      In your spec, the desired output has a lot of unquoted strings for 
attribute names, for example nbd_no. The code produces these strings with 
double quotes, which as far as I can see is necessary for legal JSON.

5.      Note that all numerical values appear in the output as strings. No 
doubt they could be converted to numbers, but I was too lazy to find out how.

6.      I have done this using Moose 5.1 (Pharo 4.0, build #40613), with 
versions of XMLHTMLParser and XPath which I downloaded quite a while ago. There 
are no particularly abstruse uses, so I hope you will be OK if you use more 
recent versions.

 

Hope this is helpful.

 

Best wishes

 

Peter Kenny

 

From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
Bernardo Ezequiel Contreras
Sent: 27 June 2016 15:17
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] If you have to do web data scraping, what tool would 
you use?

 

Doru,

 See attached file, it's a job posting from upwork.

 

On Mon, Jun 27, 2016 at 3:58 AM, Tudor Girba <tu...@tudorgirba.com 
<mailto:tu...@tudorgirba.com> > wrote:

Hi,

Could you provide more details about the use case?

Cheers,
Doru



> On Jun 26, 2016, at 11:14 PM, Bernardo Ezequiel Contreras 
> <vonbecm...@gmail.com <mailto:vonbecm...@gmail.com> > wrote:
>
> Hi,
> Imagine that you have to do some data scraping work, what tool would you use?
> I know about ZnClient, Soup, NeoCSV, NeoJSON, is there something else that 
> i'm not aware of it?
>
> thanks.
>
>
> --
> Bernardo E.C.
>
> Sent from a cheap desktop computer in South America.

--
www.tudorgirba.com <http://www.tudorgirba.com> 
www.feenk.com <http://www.feenk.com> 

"If you can't say why something is relevant,
it probably isn't."







 

-- 

Bernardo E.C.

 

Sent from a cheap desktop computer in South America.

Attachment: BernardoDemo.st
Description: Binary data

Reply via email to