mhysnm1...@gmail.com wrote: > All, > > > > Goal of new project. > > I want to scrape all my books from Audible.com that I have purchased. > Eventually I want to export this as a CSV file or maybe Json. I have not > got > that far yet. The reasoning behind this is to learn selenium for my work > and get the list of books I have purchased. Killing two birds with one > stone > here. The work focus is to see if selenium can automate some of the > testing I have to do and collect useful information from the web page for > my reports. This part of the goal is in the future. As I need to build my > python skills up. > > > > Thus far, I have been successful in logging into Audible and showing the > library of books. I am able to store the table of books and want to use > BeautifulSoup to extract the relevant information. Information I will want > from the table is: > > * Author > * Title > * Date purchased > * Length > * Is the book in a series (there is a link for this) > * Link to the page storing the publish details. > * Download link > > Hopefully this has given you enough information on what I am trying to > achieve at this stage. AS I learn more about what I am doing, I am adding > possible extra's tasks. Such as verifying if I have the book already > download via itunes. > > > > Learning goals: > > Using the BeautifulSoup structure that I have extracted from the page > source for the table. I want to navigate the tree structure. BeautifulSoup > provides children, siblings and parents methods. This is where I get stuck > with programming logic. BeautifulSoup does provide find_all method plus > selectors which I do not want to use for this exercise. As I want to learn > how to walk a tree starting at the root and visiting each node of the > tree.
I think you make your life harder than necessary if you avoid the tools provided by the library you are using. > Then I can look at the attributes for the tag as I go. I believe I > have to set up a recursive loop or function call. Not sure on how to do > this. Pseudo code: > > > > Build table structure > > Start at the root node. > > Check to see if there is any children. > > Pass first child to function. > > Print attributes for tag at this level > > In function, check for any sibling nodes. > > If exist, call function again > > If no siblings, then start at first sibling and get its child. > > > > This is where I get struck. Each sibling can have children and they can > have siblings. So how do I ensure I visit each node in the tree? The problem with your description is that siblings do not matter. Just - process root - iterate over its children and call the function recursively with every child as the new root. To make the function more useful you can pass a function instead of hard- coding what you want to do with the elements. Given def process_elements(elem, do_stuff): do_stuff(elem) for child in elem.children: process_elements(child, do_stuff) you can print all elements with soup = BeautifulSoup(...) process_elements(soup, print) and process_elements(soup, lambda elem: print(elem.name)) will print only the names. You need a bit of error checking to make it work, though. But wait -- Python's generators let you rewrite process_elements so that you can use it without a callback: def gen_elements(elem): yield elem for child in elem.children: yield from gen_elements(child) for elem in gen_elements(soup): print(elem.name) Note that 'yield from iterable' is a shortcut for 'for x in iterable: yield x', so there are actually two loops in gen_elements(). > Any tips or tricks for this would be grateful. As I could use this in > other situations. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor