On 13/12/15 07:44, Crusier wrote: > Dear All, > > I am trying to scrap the following website, however, I have > encountered some problems. As you can see, I am not really familiar > with regex and I hope you can give me some pointers to how to solve > this problem.
I'm not sure why you mention regex because your script doesn't use regex. And for html that's a good thing. > I hope I can download all the transaction data into the database. > However, I need to retrieve it first. The data which I hope to > retrieve it is as follows: > > " > 15:59:59 A 500 6.790 3,395 > 15:59:53 B 500 6.780 3,390................ > Part of your problem is that the data is not in html format but is in fact part of the Javascript code on the page. And BeautifulSoup is not so good at parsing Javascript. The page code looks like <script type="text/javascript" src="../js/jquery.js?verID=20150826_153700"></script> <script type="text/javascript" src="../js/common_eng.js?verID=20150826_153700"></script> <script type="text/javascript" src="../js/corsrequest.js?verID=20150826_153700"></script> <script type="text/javascript" src="../js/wholedaytran.js?verID=20150826_153700"></script> <script type="text/javascript"> var json_result = {"content":{"0":{"code":"6,881","timestamp":"15:59:59","order":"1175","transaction_type":"","bidask":"...{"code":"6,881","timestamp":"15:59:53","order":"1174","transaction_type":"",...{"code":"6,881","timestamp":"15:59:53","order":"1173",... followed by a bunch of function definitions and other stuff. > def turnover_detail(url): > response = requests.get(url) > html = response.content > soup = BeautifulSoup(html,"html.parser") > data = soup.find_all("script") > for json in data: > print(json) The name json here is misleading because it's not really json data at this point but javascript code. You will need to further filter the code lines down to the ones containing data and then convert them into pure json. You don't show us the output but I'm assuming it's the full javascript program? If so you need a second level of parsing to extract the data from that. It shouldn't be too difficult since the data you want all starts with the string {"code": apart from the first line which will need a little bit extra work. But I don't think you really need any regex to do this, regular string methods should suffice. I suggest you should write a helper function to do the data extraction and experiment in the interpreter using some cut n pasted sample data till you get it right! -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor