Elisp Tutorial: HTML Syntax Coloring Code Block

Xah Lee Wed, 17 Oct 2007 21:23:11 -0700

Elisp Tutorial: HTML Syntax Coloring Code Block

Xah Lee, 2007-10


This page shows a example of writing a emacs lisp function that
process a block of text to syntax color it by HTML tags. If you don't
know elisp, first take a gander at Emacs Lisp Basics.

HTML version with color and links is at:
http://xahlee.org/emacs/elisp_htmlize.html

---------------------------------------
THE PROBLEM

SUMMARY

I want to write a elisp function, such that when invoked, the block of
text the cursor is on, will have various HTML style tags wrapped
around them. This is for the purpose of publishing programing language
code in HTML on the web.

DETAIL

I write a lot computer programing tutorials for several computer
languages. For example: Perl and Python tutorial, Java tutorial, Emacs
Lisp tutorial, Javascript tutorial. In these tutorials, often there
are code snippets. These code need to be syntax colored in HTML.

For example, here's a elisp code snippet:

(if (< 3 2)  (message "yes") )

Here's what i actually want as raw HTML:

(<span class="keyword">if</span> (&lt; 3 2)  (message <span
class="string">"yes"</span>) )

Which should looks like this in a web browser:

(if (< 3 2)  (message "yes") )

There is a emacs package that turns a syntax-colored text in emacs to
HTML form. This is extremely nice. The package is called htmlize.el
and is written (1997,...,2006) by Hrvoje Niksic, available at
http://fly.srk.fer.hr/~hniksic/emacs/htmlize.el.

This program provides you with a few new emacs commands. Primarily, it
has htmlize-region, htmlize-buffer, htmlize-file. The region and
buffer commands will output HTML code in a new buffer, and the htmlize-
file version will take a input file name and output into a file.

When i need to include a code snippet in my tutorial, typically, i
write the code in a separate file (e.g. “temp.java”, “temp.py”), run
it to make sure the code is correct (compile, if necessary), then,
copy the file into the HTML tutorial page, inside a «pre» block. In
this scheme, the best way for me to utilize htmlize.el program is to
use the “html-buffer” command on my temp.java, then copy the htmlized
output and paste that into my HTML tutorial file inside a «pre» block.
Since many of my tutorials are written haphazardly over the years
before seeing the need for syntax coloration, most exist inside «pre»
tags already without a temp code file. So, in most cases, what i do is
to select the text inside the «pre» tag, paste into a temp buffer and
invoke the right mode for the language (so the text will be fontified
correctly), then do htmlize-buffer, then copy the html output, then
paste back to replace the selected text.

This process is tedious. A tutorial page will have several code
blocks. For each, i will need to select text, create a buffer, switch
mode, do htmlize, select again, switch buffer, then paste. Many of the
steps are not pure push-buttons operations but involves eye-balling.
There are few hundred such pages.

It would be better, if i can place the cursor on a code block in a
existing HTML page, then press a button, and have emacs magically
replace the code block with htmlized version colorized for the code
block's language. We proceed to write this function.

---------------------------------------
SOLUTION

For a elisp expert who knows how fontification works in emacs, the
solution would be writing a elisp code that maps emacs's string's
fontification info into html tags. This is what htmlize.el do exactly.
Since it is already written, a elisp expert might find the essential
code in htmlize.el. (the code is licensed under GPL) .

Unfortunately, my lisp experience isn't so great. I spent maybe 30
minutes tried to look in htmlize.html in hope to find a function
something like htmlize-str that is the essence, but wasn't successful.
I figured, it is actually faster if i took the dumb and inefficient
approach, by writing a elisp code that extracts the output from
htmlize-buffer. Here's the outline of the plan of my function:

    * 1. Grab the text inside a <pre class="«lang»">...</pre> tag.
    * 2. Create a new buffer. Paste the code in.
    * 3. Make the new buffer «lang» mode (and fontify it)
    * 4. Call htmlize-buffer
    * 5. Grab the (htmlized) text inside «pre» tag in the htmlize
created output buffer.
    * 6. Kill the htmlize buffer and my temp buffer.
    * 7. Delete the original text, paste in the new text.

To achieve the above, i decided on 2 steps. A: Write a function
“htmlize-string” that takes a string and mode name, and returns the
htmlized string. B: Write a function “htmlize-block” that does the
steps of grabbing text and pasting, and calls “htmlize-string” for the
actual htmlization.

Here's the code of my htmlize-string function:

(defun htmlize-string (ccode mn)
"Take string ccode and return htmlized code, using mode mn.\n
This function requries the htmlize-mode.el by Hrvoje Niksic, 2006"
(let (cur-buf temp-buf temp-buf2 x1 x2 resultS)
    (setq cur-buf (buffer-name))
    (setq temp-buf "xout-weewee")
    (setq temp-buf2 "*html*") ;; the buffer that htmlize-buffer
creates

    ; put the code in a new buffer, set the mode
    (switch-to-buffer temp-buf)
    (insert ccode)
    (funcall (intern mn))

    (htmlize-buffer temp-buf)
    (kill-buffer temp-buf)
    (switch-to-buffer temp-buf2)

    ; extract the core code
    (setq x1 (re-search-forward "<pre>"))
    (setq x1 (+ x1 1))
    (re-search-forward "</pre>")
    (setq x2 (re-search-backward "</pre>"))
    (setq resultS (buffer-substring-no-properties x1 x2))
    (kill-buffer temp-buf2)

    (switch-to-buffer cur-buf)
    resultS
)
)

The major part in this code is knowing how to create, switch, kill
buffers. Then, how to set a mode. Lastly, how to grab text in a
buffer.

Current buffer is given by “buffer-name”. To create or switch buffer
is done by “switch-to-buffer”. Kill buffer is “kill-buffer”. To
activate a mode, the code is “(funcall (intern my-mode-name))”. I
don't know why this is so in detail, but it is interesting to know.

The grabbing text is done by locating the desired beginning and ending
locations using re-search functions, and buffer-substring-no-
properties for actually extracting the string.

Here, note the “no-properties” in “buffer-substring-no-properties”.
Emacs's string can contain information called properties, which is
essentially the fontification information.

Reference: Elisp Manual: Buffers.

Reference: Elisp Manual: Text-Properties.

Here's the code of my htmlize-block function:

(defun htmlize-block ()
  "Replace the region enclosed by <pre> tag to htmlized code.
For example, if the cursor somewhere inside the tag:

<pre cla ss=\"code\">
codeXYZ...
</pre>

after calling, the “codeXYZ...” block of text will be htmlized.
That is, wrapped with many <span> tags.

The opening tag must be of the form <pre cla ss=\"lang-str\">.
The “lang-str” determines what emacs mode is used to colorize
the code.
This function requires htmlize.el by Hrvoje Niksic."

(interactive)
(let (mycode tag-begin styclass code-begin code-end tag-end mymode)
  (progn
    (setq tag-begin (re-search-backward "<pre class=\"\\([A-z-]+\\)
\""))
    (setq styclass (match-string 1))
    (setq code-begin (re-search-forward ">"))
    (re-search-forward "</pre>")
    (setq code-end (re-search-backward "<"))
    (setq tag-end (re-search-forward "</pre>"))
    (setq mycode (buffer-substring-no-properties code-begin code-end))
    )
  (cond
   ((equal styclass "elisp") (setq mymode "emacs-lisp-mode"))
   ((equal styclass "perl") (setq mymode "cperl-mode"))
   ((equal styclass "python") (setq mymode "python-mode"))
   ((equal styclass "java") (setq mymode "java-mode"))
   ((equal styclass "html") (setq mymode "html-mode"))
    ((equal styclass "haskell") (setq mymode "haskell-mode"))
   )
  (save-excursion
    (delete-region code-begin code-end)
    (goto-char code-begin)
    (insert (htmlize-string mycode mymode))
    )
  )
)

The steps of this function is to grab the text inside a «pre» block,
call htmlize-string, then insert the result replacing text.

Originally, i wrote the code to grab text by inside plain “<pre>...</
pre>” tags, then use some heuristics to determine what language it is,
then call htmlize-string with the mode-name passed to it. However,
since my html pages already has the language information in the form
of “<pre class="«lang»">...</pre>” (for CSS reasons), so, now i search
text by that form, and use the “lang” part to determine a mode.

Emacs is beautiful.

Postscript:

The story given above is slightly simplified. For example, when i
began my language notes and commentaries, they were not planned to be
some systematic or sizable tutorial. As the pages grew, more quality
are added in editorial process. So, a plain un-colored code inside
«pre» started to have “language comment” strings colorized (e.g.
“<span class="cmt">#...</span>), by using a simple elisp code that
wraps a tag on them, and this function is mapped to shortcut key for
easy execution. As pages and languages grew, i find colorizing comment
isn't enough, then i started to look for a syntax-coloring html
solution. There are solutions in Perl, Python, PHP, but I find emacs
solution best suites my needs in particular because it integrates with
emacs's interactive nature, and my writing work is done in a
accumulative, editorial process.

In the beginning i used htmlize-region and htmlize-buffer as they are
for new code. Note that this is still a laborious process. Gradually i
need to colorized my old code. The problem is that many already
contain my own «span class="cmt"» tags, and strings common in computer
languages such as “<=” have already been transformed into required
html encoding “&lt;=”. So, the elisp code will first “un-htmlize”
these in my htmlize-block code. But once all my existing code has been
so newly colorized, the part of code to transform strings for un-
htmlize is no longer necessary, so they are taken out in htmlize-block
and resumes a cleaner state. Also, htmlize-block went thru many
revisions over the year. Sometimes in recent past, i had one code
wrapper for each language. For example, i had htmlize-me-perl, htmlize-
me-python, htmlize-me-java, etc. The need for unification into a
single coherent wrapper code didn't materialize. In general, it is my
experience, in particular in writing elisp customization for emacs,
that tweaking code periodically thru the year is practical, because it
adapts to the constant changes of requirements, environment, work
process. For example, eventually i might write my own htmlize.el, if i
happen to need more flexibility, or if my elisp experience
sufficiently makes the job relatively easy.

Also note: a whole-sale solution is to write a program, in say,
Python, that process html files and replace proper sections by
htmlized string. This is perhaps more efficient if all the existing
html files are in some uniform format. However, i need to work on my
tutorials on a case-by-case basis. In part, because, some pages
contain multiple languages or contains pseudo-code that i do not wish
colorized. (For example, some pages contains codes of the Mathematica↗
language. Mathematica code is normally done in Mathematica's
mathematical typesetting capable “front-end” IDE called “Notebook” and
is not “syntax-colored” as such.)

  Xah
  [EMAIL PROTECTED]
∑ http://xahlee.org/

-- 
http://mail.python.org/mailman/listinfo/python-list

Elisp Tutorial: HTML Syntax Coloring Code Block

Reply via email to