Re: [go-nuts] The side effect of calling html.Token()

Nigel Tao Mon, 15 Jan 2018 14:59:45 -0800

On Sun, Jan 14, 2018 at 4:33 PM, Tong Sun <suntong...@gmail.com> wrote:
> Not being able to do that, I have to save all the Token() info to different
> variables, then pass all those variables to my function separately, instead
> of passing merely a single tokenizer.

Instead of using different variables, I'd just pass the Token itself
around. It should already contain everything you need. For example, if
t is a variable of type html.Token, and t.Type is html.StartTagToken,
html.EndTagToken or html.SelfClosingTagToken, then t.Data is the tag
name, such as "script". It's a string-typed field, not a method that
returns a string, so there's no restrictions like those on calling
Tokenizer method multiple times.

As an optional, advanced level comment, t.DataAtom will also be a
hashed uint32 value of that string, for well known strings. For
example, the uint32 constant atom.Script (from the
golang.org/x/net/html/atom package) corresponds to a "script" tag.
Comparing uint32 values is noticably faster than comparing string
values, if you're doing a *lot* of tag name comparisons. For example,
I can't remember the exact number, but IIRC, the x/net/html parser
(which builds a DOM tree from the token stream) got a 10% or 30% speed
boost by comparing atoms instead of strings.

On Sun, Jan 14, 2018 at 4:53 PM, Tong Sun <suntong...@gmail.com> wrote:
>  Actually, found out I only called Token() once:
>
> https://play.golang.org/p/HtevQ3RbQsi
>
>  reader := strings.NewReader("<div class=\"hello\">SomeText</div>")
>  tokenizer := html.NewTokenizer(reader)
>  tokenizer.Next()
>  fmt.Println(tokenizer.TagName())
>  fmt.Println(tokenizer.Token())

Again, from the "EBNF" in the package documentation:

----
In EBNF notation, the valid call sequence per token is:

Next {Raw} [ Token | Text | TagName {TagAttr} ]
----

If you're not familiar with EBNF
(https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form),
this means that you can call the Token method (once), or call the
TagName method (once), but not call both.

> I.e., what I used in the loop the standard way was `TagName()`:
>
>  case html.StartTagToken, html.EndTagToken:
>  tn, _ := z.TagName()
>  tag := strings.ToLower(string(tn))

The strings.ToLower call should be unnecessary. As
https://godoc.org/golang.org/x/net/html#Tokenizer.TagName says,
"TagName returns the *lower-cased* name of a tag token", (emphasis
added).

> But by the time I need to call Token() within my function (printElmt) to get
> the full token info, it's already impossible.

As I said earlier in this message, just call Token (the method) once,
at the top of the loop, and switch on its Type field, pass the Token
(the type) around, etc.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] The side effect of calling html.Token()

Reply via email to