I've been working on 2 sites that use full text search on a wiki-like
system where users use a WYSIWYG/html editor. This, obviously, doesn't
apply to flatpages, but the poblem/solution might be of help.

The problem is, if you try tindex/search html, you get a lot terrible
results. For example, if you were searching for the word class,
virtually ever document would come up as result because of the html
class="" bit.

The solution we are using was an off shoot of the wiki-app from pinax.

We use google's diff_match_patch library alot.here is the basic
rundown

1. Get HTML content from a from submitted by user.
2. use django's defaultfilter striptags to strip the html tags
3. used diff_match_patch to create a patch between the plain text and
html
4. save the plain text as the content on the document model
5. save a text version of the patch on the document model
6. index the plain text when you search for strings, this will be what
the search is performed on

An instance method that accepts no parameters ( so it can be used in
templates ) is used to recreate the HTML from the patch and plain text
and that is displayed when a user wants to view the page. As of yet,
we haven't seen an difference in performance when rendering pages.

While, I'm not certain, I'm not so sure that flatpages would allow you
to do this. And the solution is probably a bit more complicated than
what one would want to do for just indexing static pages. I maybe
wrong, but I don't think you will be index full HTML content and only
search the plain text with out doing some kind of conversion of the
HTML first.

This does, however, depend on how you set up your templates for
flatpages. I have had flat pages that extend a base template and just
render out plain text into generic <div> container. If you were to do
that, then you could index the flatpage content as it would only be
plain text. this could be done fairly easily with django's orm.

end ramble.

On Jan 14, 5:04 am, Amit Sethi <amit.pureene...@gmail.com> wrote:
> Hi , I have  a project with a few static html pages , I wish to search these
> static html using a django search app . Most of the tutorials I saw are
> focused on searching django models but none I could see concentrates on
> indexing static html pages . Can some one guide be to a library /tutorial
> etc for searching and indexing static pages in a project .
>
> --
> A-M-I-T S|S
-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.


Reply via email to