On 2022-08-22 19:27:28 -, Jon Ribbens via Python-list wrote:
> On 2022-08-22, Peter J. Holzer wrote:
> > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> >> With the offset though, BeautifulSoup made an arbitrary decision to
> >> use ISO-8859-1 encoding and so when you choppe
On 2022-08-22, Peter J. Holzer wrote:
> On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
>> With the offset though, BeautifulSoup made an arbitrary decision to
>> use ISO-8859-1 encoding and so when you chopped the bytestring at
>> that offset it only worked because BeautifulSoup h
On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> With the offset though, BeautifulSoup made an arbitrary decision to
> use ISO-8859-1 encoding and so when you chopped the bytestring at
> that offset it only worked because BeautifulSoup had happened to
> choose a 1-byte-per-charact
On 2022-08-22 00:09:01 -, Jon Ribbens via Python-list wrote:
> On 2022-08-21, Peter J. Holzer wrote:
> > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> >> result = re.sub(
> >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
> >
> > This will fail on:
> >
>
On 2022-08-21, Chris Angelico wrote:
> On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
> wrote:
>> On 2022-08-21, Chris Angelico wrote:
>> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
>> > wrote:
>> >> On 2022-08-20, Chris Angelico wrote:
>> >> > On Sun, 21 Aug 2022 at 0
On 2022-08-21, Peter J. Holzer wrote:
> On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
>> On 2022-08-20, Stefan Ram wrote:
>> > Jon Ribbens writes:
>> >>... or you could avoid all that faff and just do re.sub()?
>
>> > source = ''
>> >
>> > # Use Python to change the source, ke
On 22/08/2022 05:30, Chris Angelico wrote:
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote:
I've had much success doing round trips through the lxml.html parser.
https://lxml.de/lxmlhtml.html
I ditched bs for lxml long ago and never regretted it.
If you find that you have a bunch of invalid h
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote:
>
> I've had much success doing round trips through the lxml.html parser.
>
> https://lxml.de/lxmlhtml.html
>
> I ditched bs for lxml long ago and never regretted it.
>
> If you find that you have a bunch of invalid html that lxml inadvertently
> "fi
I've had much success doing round trips through the lxml.html parser.
https://lxml.de/lxmlhtml.html
I ditched bs for lxml long ago and never regretted it.
If you find that you have a bunch of invalid html that lxml inadvertently
"fixes", I would recommend adding a stutter-step to your project: p
On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
wrote:
>
> On 2022-08-21, Chris Angelico wrote:
> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> > wrote:
> >> On 2022-08-20, Chris Angelico wrote:
> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
> >> >> 2qdxy4rzwzu
On 2022-08-21, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> wrote:
>> On 2022-08-20, Chris Angelico wrote:
>> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
>> >> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >> >textual representations. That way, the f
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> On 2022-08-20, Stefan Ram wrote:
> > Jon Ribbens writes:
> >>... or you could avoid all that faff and just do re.sub()?
> > source = ''
> >
> > # Use Python to change the source, keeping the order of attributes.
> >
> > result =
> On 21 Aug 2022, at 09:12, Chris Angelico wrote:
>
> On Sun, 21 Aug 2022 at 17:26, Barry wrote:
>>
>>
>>
On 19 Aug 2022, at 22:04, Chris Angelico wrote:
>>>
>>> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>>
On Sun, 21 Aug 2022 at 17:26, Barry wrote:
>
>
>
> > On 19 Aug 2022, at 22:04, Chris Angelico wrote:
> >
> > On Sat, 20 Aug 2022 at 05:12, Barry wrote:
> >>
> >>
> >>
> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file a
> On 19 Aug 2022, at 22:04, Chris Angelico wrote:
>
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>
>>
>>
On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>>>
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsing it with BeautifulSoup?
>>
>> I recall that
On Sun, 21 Aug 2022 at 13:41, dn wrote:
>
> On 21/08/2022 13.00, Chris Angelico wrote:
> > Well, I don't like headaches, but I do appreciate what the G&S Archive
> > has given me over the years, so I'm taking this on as a means of
> > giving back to the community.
>
> This point will be picked-up
On 21/08/2022 13.00, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 09:48, dn wrote:
>> On 20/08/2022 12.38, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 10:19, dn wrote:
On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>> On 19 Aug 2022,
On Sun, 21 Aug 2022 at 09:48, dn wrote:
>
> On 20/08/2022 12.38, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 10:19, dn wrote:
> >> On 20/08/2022 09.01, Chris Angelico wrote:
> >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
> > On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >
> >
On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
wrote:
>
> On 2022-08-20, Chris Angelico wrote:
> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
> >> 2qdxy4rzwzuui...@potatochowder.com writes:
> >> >textual representations. That way, the following two elements are the
> >> >same (a
On 20/08/2022 12.38, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 10:19, dn wrote:
>> On 20/08/2022 09.01, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>
> What's the best way to precisely reconstruct an HTM
On 2022-08-20, Stefan Ram wrote:
> Jon Ribbens writes:
>>... or you could avoid all that faff and just do re.sub()?
>
> import bs4
> import re
>
> source = ''
>
> # Use Python to change the source, keeping the order of attributes.
>
> result = re.sub( r'href\s*=\s*"http"', r'href="https"', source
On 2022-08-20, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
>> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >textual representations. That way, the following two elements are the
>> >same (and similar with a collection of sub-elements in a different order
>> >in anoth
On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
>
> 2qdxy4rzwzuui...@potatochowder.com writes:
> >textual representations. That way, the following two elements are the
> >same (and similar with a collection of sub-elements in a different order
> >in another document):
>
> The /elements/ differ.
On 2022-08-19, Chris Angelico wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
html_doc = """The Dormouse's story
>
>The Dormouse's story
>
>Once upon a time there were three little siste
On Sat, 20 Aug 2022 at 10:19, dn wrote:
>
> On 20/08/2022 09.01, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 05:12, Barry wrote:
> >>
> >>
> >>
> >>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it
On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>
>>
>>
>>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>>>
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsing it with BeautifulSoup?
>>
>> I recall that in bs4 it parses int
On Sat, 20 Aug 2022 at 10:04, David wrote:
>
> On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote:
>
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> > Note two distinct changes: firstly, whitespace has been removed, and
> > secondly, attr
On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
>
On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>
>
>
> > On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into an object tree and loses the detail of
> the in
On 2022-08-19 at 20:12:35 +0100,
Barry wrote:
> > On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into an object tree and loses the
> detail of the inp
> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
I recall that in bs4 it parses into an object tree and loses the detail of the
input.
I recently ported from very old bs to bs4 and hit the s
31 matches
Mail list logo