Elisp Tutorial: HTML Syntax Coloring Code Block Xah Lee, 2007-10
This page shows a example of writing a emacs lisp function that process a block of text to syntax color it by HTML tags. If you don't know elisp, first take a gander at Emacs Lisp Basics. HTML version with color and links is at: http://xahlee.org/emacs/elisp_htmlize.html --------------------------------------- THE PROBLEM SUMMARY I want to write a elisp function, such that when invoked, the block of text the cursor is on, will have various HTML style tags wrapped around them. This is for the purpose of publishing programing language code in HTML on the web. DETAIL I write a lot computer programing tutorials for several computer languages. For example: Perl and Python tutorial, Java tutorial, Emacs Lisp tutorial, Javascript tutorial. In these tutorials, often there are code snippets. These code need to be syntax colored in HTML. For example, here's a elisp code snippet: (if (< 3 2) (message "yes") ) Here's what i actually want as raw HTML: (<span class="keyword">if</span> (< 3 2) (message <span class="string">"yes"</span>) ) Which should looks like this in a web browser: (if (< 3 2) (message "yes") ) There is a emacs package that turns a syntax-colored text in emacs to HTML form. This is extremely nice. The package is called htmlize.el and is written (1997,...,2006) by Hrvoje Niksic, available at http://fly.srk.fer.hr/~hniksic/emacs/htmlize.el. This program provides you with a few new emacs commands. Primarily, it has htmlize-region, htmlize-buffer, htmlize-file. The region and buffer commands will output HTML code in a new buffer, and the htmlize- file version will take a input file name and output into a file. When i need to include a code snippet in my tutorial, typically, i write the code in a separate file (e.g. “temp.java”, “temp.py”), run it to make sure the code is correct (compile, if necessary), then, copy the file into the HTML tutorial page, inside a «pre» block. In this scheme, the best way for me to utilize htmlize.el program is to use the “html-buffer” command on my temp.java, then copy the htmlized output and paste that into my HTML tutorial file inside a «pre» block. Since many of my tutorials are written haphazardly over the years before seeing the need for syntax coloration, most exist inside «pre» tags already without a temp code file. So, in most cases, what i do is to select the text inside the «pre» tag, paste into a temp buffer and invoke the right mode for the language (so the text will be fontified correctly), then do htmlize-buffer, then copy the html output, then paste back to replace the selected text. This process is tedious. A tutorial page will have several code blocks. For each, i will need to select text, create a buffer, switch mode, do htmlize, select again, switch buffer, then paste. Many of the steps are not pure push-buttons operations but involves eye-balling. There are few hundred such pages. It would be better, if i can place the cursor on a code block in a existing HTML page, then press a button, and have emacs magically replace the code block with htmlized version colorized for the code block's language. We proceed to write this function. --------------------------------------- SOLUTION For a elisp expert who knows how fontification works in emacs, the solution would be writing a elisp code that maps emacs's string's fontification info into html tags. This is what htmlize.el do exactly. Since it is already written, a elisp expert might find the essential code in htmlize.el. (the code is licensed under GPL) . Unfortunately, my lisp experience isn't so great. I spent maybe 30 minutes tried to look in htmlize.html in hope to find a function something like htmlize-str that is the essence, but wasn't successful. I figured, it is actually faster if i took the dumb and inefficient approach, by writing a elisp code that extracts the output from htmlize-buffer. Here's the outline of the plan of my function: * 1. Grab the text inside a <pre class="«lang»">...</pre> tag. * 2. Create a new buffer. Paste the code in. * 3. Make the new buffer «lang» mode (and fontify it) * 4. Call htmlize-buffer * 5. Grab the (htmlized) text inside «pre» tag in the htmlize created output buffer. * 6. Kill the htmlize buffer and my temp buffer. * 7. Delete the original text, paste in the new text. To achieve the above, i decided on 2 steps. A: Write a function “htmlize-string” that takes a string and mode name, and returns the htmlized string. B: Write a function “htmlize-block” that does the steps of grabbing text and pasting, and calls “htmlize-string” for the actual htmlization. Here's the code of my htmlize-string function: (defun htmlize-string (ccode mn) "Take string ccode and return htmlized code, using mode mn.\n This function requries the htmlize-mode.el by Hrvoje Niksic, 2006" (let (cur-buf temp-buf temp-buf2 x1 x2 resultS) (setq cur-buf (buffer-name)) (setq temp-buf "xout-weewee") (setq temp-buf2 "*html*") ;; the buffer that htmlize-buffer creates ; put the code in a new buffer, set the mode (switch-to-buffer temp-buf) (insert ccode) (funcall (intern mn)) (htmlize-buffer temp-buf) (kill-buffer temp-buf) (switch-to-buffer temp-buf2) ; extract the core code (setq x1 (re-search-forward "<pre>")) (setq x1 (+ x1 1)) (re-search-forward "</pre>") (setq x2 (re-search-backward "</pre>")) (setq resultS (buffer-substring-no-properties x1 x2)) (kill-buffer temp-buf2) (switch-to-buffer cur-buf) resultS ) ) The major part in this code is knowing how to create, switch, kill buffers. Then, how to set a mode. Lastly, how to grab text in a buffer. Current buffer is given by “buffer-name”. To create or switch buffer is done by “switch-to-buffer”. Kill buffer is “kill-buffer”. To activate a mode, the code is “(funcall (intern my-mode-name))”. I don't know why this is so in detail, but it is interesting to know. The grabbing text is done by locating the desired beginning and ending locations using re-search functions, and buffer-substring-no- properties for actually extracting the string. Here, note the “no-properties” in “buffer-substring-no-properties”. Emacs's string can contain information called properties, which is essentially the fontification information. Reference: Elisp Manual: Buffers. Reference: Elisp Manual: Text-Properties. Here's the code of my htmlize-block function: (defun htmlize-block () "Replace the region enclosed by <pre> tag to htmlized code. For example, if the cursor somewhere inside the tag: <pre cla ss=\"code\"> codeXYZ... </pre> after calling, the “codeXYZ...” block of text will be htmlized. That is, wrapped with many <span> tags. The opening tag must be of the form <pre cla ss=\"lang-str\">. The “lang-str” determines what emacs mode is used to colorize the code. This function requires htmlize.el by Hrvoje Niksic." (interactive) (let (mycode tag-begin styclass code-begin code-end tag-end mymode) (progn (setq tag-begin (re-search-backward "<pre class=\"\\([A-z-]+\\) \"")) (setq styclass (match-string 1)) (setq code-begin (re-search-forward ">")) (re-search-forward "</pre>") (setq code-end (re-search-backward "<")) (setq tag-end (re-search-forward "</pre>")) (setq mycode (buffer-substring-no-properties code-begin code-end)) ) (cond ((equal styclass "elisp") (setq mymode "emacs-lisp-mode")) ((equal styclass "perl") (setq mymode "cperl-mode")) ((equal styclass "python") (setq mymode "python-mode")) ((equal styclass "java") (setq mymode "java-mode")) ((equal styclass "html") (setq mymode "html-mode")) ((equal styclass "haskell") (setq mymode "haskell-mode")) ) (save-excursion (delete-region code-begin code-end) (goto-char code-begin) (insert (htmlize-string mycode mymode)) ) ) ) The steps of this function is to grab the text inside a «pre» block, call htmlize-string, then insert the result replacing text. Originally, i wrote the code to grab text by inside plain “<pre>...</ pre>” tags, then use some heuristics to determine what language it is, then call htmlize-string with the mode-name passed to it. However, since my html pages already has the language information in the form of “<pre class="«lang»">...</pre>” (for CSS reasons), so, now i search text by that form, and use the “lang” part to determine a mode. Emacs is beautiful. Postscript: The story given above is slightly simplified. For example, when i began my language notes and commentaries, they were not planned to be some systematic or sizable tutorial. As the pages grew, more quality are added in editorial process. So, a plain un-colored code inside «pre» started to have “language comment” strings colorized (e.g. “<span class="cmt">#...</span>), by using a simple elisp code that wraps a tag on them, and this function is mapped to shortcut key for easy execution. As pages and languages grew, i find colorizing comment isn't enough, then i started to look for a syntax-coloring html solution. There are solutions in Perl, Python, PHP, but I find emacs solution best suites my needs in particular because it integrates with emacs's interactive nature, and my writing work is done in a accumulative, editorial process. In the beginning i used htmlize-region and htmlize-buffer as they are for new code. Note that this is still a laborious process. Gradually i need to colorized my old code. The problem is that many already contain my own «span class="cmt"» tags, and strings common in computer languages such as “<=” have already been transformed into required html encoding “<=”. So, the elisp code will first “un-htmlize” these in my htmlize-block code. But once all my existing code has been so newly colorized, the part of code to transform strings for un- htmlize is no longer necessary, so they are taken out in htmlize-block and resumes a cleaner state. Also, htmlize-block went thru many revisions over the year. Sometimes in recent past, i had one code wrapper for each language. For example, i had htmlize-me-perl, htmlize- me-python, htmlize-me-java, etc. The need for unification into a single coherent wrapper code didn't materialize. In general, it is my experience, in particular in writing elisp customization for emacs, that tweaking code periodically thru the year is practical, because it adapts to the constant changes of requirements, environment, work process. For example, eventually i might write my own htmlize.el, if i happen to need more flexibility, or if my elisp experience sufficiently makes the job relatively easy. Also note: a whole-sale solution is to write a program, in say, Python, that process html files and replace proper sections by htmlized string. This is perhaps more efficient if all the existing html files are in some uniform format. However, i need to work on my tutorials on a case-by-case basis. In part, because, some pages contain multiple languages or contains pseudo-code that i do not wish colorized. (For example, some pages contains codes of the Mathematica↗ language. Mathematica code is normally done in Mathematica's mathematical typesetting capable “front-end” IDE called “Notebook” and is not “syntax-colored” as such.) Xah [EMAIL PROTECTED] ∑ http://xahlee.org/ -- http://mail.python.org/mailman/listinfo/python-list