https://bugzilla.redhat.com/show_bug.cgi?id=2319926
Bug ID: 2319926
Summary: Review-request: python-html-text - Extract text from
HTML
Product: Fedora
Version: rawhide
OS: Linux
Status: NEW
Component: Package Review
Severity: medium
Assignee: [email protected]
Reporter: [email protected]
QA Contact: [email protected]
CC: [email protected]
Target Milestone: ---
Classification: Fedora
spec:
https://download.copr.fedorainfracloud.org/results/fed500/gourmand/fedora-rawhide-x86_64/08156160-python-html-text/python-html-text.spec
srpm:
https://download.copr.fedorainfracloud.org/results/fed500/gourmand/fedora-rawhide-x86_64/08156160-python-html-text/python-html-text-0.6.2-1.fc42.src.rpm
description:
How is html_text different from .xpath('//text()') from LXML
or .get_text() from Beautiful Soup?
- Text extracted with html_text does not contain inline styles,
javascript, comments and other text that is not normally visible
to users;
- html_text normalizes whitespace, but in a way smarter than
.xpath('normalize-space()), adding spaces around inline elements
(which are often used as block elements in html markup), and trying
to avoid adding extra spaces for punctuation;
- html-text can add newlines (e.g. after headers or paragraphs), so
that the output text looks more like how it is rendered in browsers.
fas: fed500
Comments:
Pytest7 warning seems spurious as pytest7 is not installed.
Reproducible: Always
--
You are receiving this mail because:
You are always notified about changes to this product and component
You are on the CC list for the bug.
https://bugzilla.redhat.com/show_bug.cgi?id=2319926
Report this comment as SPAM:
https://bugzilla.redhat.com/enter_bug.cgi?product=Bugzilla&format=report-spam&short_desc=Report%20of%20Bug%202319926%23c0
--
_______________________________________________
package-review mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedoraproject.org/archives/list/[email protected]
Do not reply to spam, report it:
https://pagure.io/fedora-infrastructure/new_issue