If all you're looking for is the format CVE-NNNN-NNNNN then by all means just use regex against the plain text of the page. If you need to do dom traversal then jsoup is a good choice. Otherwise, like Mark said, tree-seq is a great choice if you don't want to play with clojure.walk.
On Wed, Feb 2, 2022 at 2:58 PM Mark Nutter <manutte...@gmail.com> wrote: > I don't know how common it is, but have you looked at the `tree-seq` > function in Clojure? This seems like a good use case for it. > > Mark > > On Wed, Feb 2, 2022 at 3:22 PM lawrence...@gmail.com < > lawrence.krub...@gmail.com> wrote: > >> Assume I've been cursed to scrape HTML. If I convert the pages to Hickory >> I end up with a big mass of data which, sadly, lacks many "class" or "id"s >> that would let me easily pick out the data I need. However, for the most >> part, the only thing I really need off this page is the CVEs, which look >> like this: >> >> CVE-2021-40539 >> >> I'm thinking I might write regex against the plain text of the page, but >> I'm also curious, is it common to take something like Hiccup or Hickory or >> a zipper and run regex through it? If yes, how is that done? >> >> A small part of the data looks like this: >> >> :content >> [{:type :element, >> :attrs >> {:class "tip-intro", :style "font-size: 15px;"}, >> :tag :p, >> :content >> [{:type :element, >> :attrs nil, >> :tag :em, >> :content >> ["This Joint Cybersecurity Advisory uses the MITRE >> Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK®) framework, >> Version 8. See the " >> {:type :element, >> :attrs >> {:href >> " >> https://attack.mitre.org/versions/v9/techniques/enterprise/"}, >> :tag :a, >> :content ["ATT&CK for Enterprise"]} >> " for referenced threat actor tactics and for >> techniques."]}]} >> "\n\n" >> {:type :element, >> :attrs nil, >> :tag :p, >> :content >> ["This joint advisory is the result of analytic efforts >> between the Federal Bureau of Investigation (FBI), United States Coast >> Guard Cyber Command (CGCYBER), and the Cybersecurity and Infrastructure >> Security Agency (CISA) to highlight the cyber threat associated with active >> exploitation of a newly identified vulnerability (CVE-2021-40539) in >> ManageEngine ADSelfService Plus—a self-service password management and >> single sign-on solution."]} >> "\n\n" >> {:type :element, >> :attrs nil, >> :tag :p, >> :content >> ["CVE-2021-40539, rated critical by the Common >> Vulnerability Scoring System (CVSS), is an authentication bypass >> vulnerability affecting representational state transfer (REST) application >> programming interface (API) URLs that could enable remote code execution. >> The FBI, CISA, and CGCYBER assess that advanced persistent threat (APT) >> cyber actors are likely among those exploiting the vulnerability. The >> exploitation of ManageEngine ADSelfService Plus poses a serious risk to >> critical infrastructure companies, U.S.-cleared defense contractors, >> academic institutions, and other entities that use the software. Successful >> exploitation of the vulnerability allows an attacker to place webshells, >> which enable the adversary to conduct post-exploitation activities, such as >> compromising administrator credentials, conducting lateral movement, and >> exfiltrating registry hives and Active Directory files."]} >> "\n\n" >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clojure@googlegroups.com >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+unsubscr...@googlegroups.com >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/clojure/5f2bd2a4-5c35-463b-9cb4-eecb9148fc89n%40googlegroups.com >> <https://groups.google.com/d/msgid/clojure/5f2bd2a4-5c35-463b-9cb4-eecb9148fc89n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/clojure/CACMqiXAG3xtxa0XzHemyi-nf-HOQa1epoN%2BJrKN5AGJo7%3DVR%3Dw%40mail.gmail.com > <https://groups.google.com/d/msgid/clojure/CACMqiXAG3xtxa0XzHemyi-nf-HOQa1epoN%2BJrKN5AGJo7%3DVR%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/CAMZDCY10O09mQ-Mtus%2B4dUKvL%2BznzehwGSfwH-bT%3DGwr%3D%3DkUtQ%40mail.gmail.com.