Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

Artur Zakirov Wed, 17 Feb 2016 05:23:39 -0800

On 16.02.2016 18:14, Artur Zakirov wrote:

I attached a new version of the patch.

Sorry for noise. I attached new version of the patch. I saw mistakes inDecodeFlag(). This patch fix them.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***************
*** 2615,2632 **** SELECT plainto_tsquery('supernova star');
     </para>
  
     <para>
!     To create an <application>Ispell</> dictionary, use the built-in
!     <literal>ispell</literal> template and specify several parameters:
     </para>
! 
  <programlisting>
! CREATE TEXT SEARCH DICTIONARY english_ispell (
      TEMPLATE = ispell,
!     DictFile = english,
!     AffFile = english,
!     StopWords = english
! );
  </programlisting>
  
     <para>
      Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
--- 2615,2655 ----
     </para>
  
     <para>
!     To create an <application>Ispell</> dictionary perform these steps:
     </para>
!    <itemizedlist spacing="compact" mark="bullet">
!     <listitem>
!      <para>
!       download dictionary configuration files. <productname>OpenOffice</>
!       extension files have the <filename>.oxt</> extension. It is necessary
!       to extract <filename>.aff</> and <filename>.dic</> files, change
!       extensions to <filename>.affix</> and <filename>.dict</>. For some
!       dictionary files it is also needed to convert characters to the UTF-8
!       encoding with commands (for example, for norwegian language dictionary):
  <programlisting>
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! </programlisting>
!      </para>
!     </listitem>
!     <listitem>
!      <para>
!       copy files to the <filename>$SHAREDIR/tsearch_data</> directory
!      </para>
!     </listitem>
!     <listitem>
!      <para>
!       load files into PostgreSQL with the following command:
! <programlisting>
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
      TEMPLATE = ispell,
!     DictFile = en_us,
!     AffFile = en_us,
!     Stopwords = english);
  </programlisting>
+      </para>
+     </listitem>
+    </itemizedlist>
  
     <para>
      Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
***************
*** 2643,2648 **** CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2721 ----
     </para>
  
     <para>
+     The <filename>.affix</> file of <application>Ispell</> has the following
+     structure:
+ <programlisting>
+ prefixes
+ flag *A:
+     .           >   RE      # As in enter > reenter
+ suffixes
+ flag T:
+     E           >   ST      # As in late > latest
+     [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+     [AEIOU]Y    >   EST     # As in gray > grayest
+     [^EY]       >   EST     # As in small > smallest
+ </programlisting>
+    </para>
+    <para>
+     And the <filename>.dict</> file has the following structure:
+ <programlisting>
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ </programlisting>
+    </para>
+ 
+    <para>
+     Format of the <filename>.dict</> file is:
+ <programlisting>
+ basic_form/affix_class_name
+ </programlisting>
+    </para>
+ 
+    <para>
+     In the <filename>.affix</> file every affix flag is described in the
+     following format:
+ <programlisting>
+ condition > [-stripping_letters,] adding_affix
+ </programlisting>
+    </para>
+ 
+    <para>
+     Here, condition has a format similar to the format of regular expressions.
+     It can use groupings <literal>[...]</> and <literal>[^...]</>.
+     For example, <literal>[AEIOU]Y</> means that the last letter of the word
+     is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
+     <literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
+     <literal>[^EY]</> means that the last letter is neither <literal>"e"</>
+     nor <literal>"y"</>.
+    </para>
+ 
+    <para>
      Ispell dictionaries support splitting compound words;
      a useful feature.
      Notice that the affix file should specify a special flag using the
***************
*** 2663,2668 **** SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2736,2800 ----
  </programlisting>
     </para>
  
+    <para>
+     <application>MySpell</> is very similar to <application>Hunspell</>.
+     The <filename>.affix</> file of <application>Hunspell</> has the following
+     structure:
+ <programlisting>
+ PFX A Y 1
+ PFX A   0     re         .
+ SFX T N 4
+ SFX T   0     st         e
+ SFX T   y     iest       [^aeiou]y
+ SFX T   0     est        [aeiou]y
+ SFX T   0     est        [^ey]
+ </programlisting>
+    </para>
+ 
+    <para>
+     The first line of an affix class is the header. Fields of an affix rules are
+     listed after the header:
+    </para>
+    <itemizedlist spacing="compact" mark="bullet">
+     <listitem>
+      <para>
+       parameter name (PFX or SFX)
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       flag (name of the affix class)
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       stripping characters from beginning (at prefix) or end (at suffix) of the
+       word
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       adding affix
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       condition that has a format similar to the format of regular expressions.
+      </para>
+     </listitem>
+    </itemizedlist>
+ 
+    <para>
+     The <filename>.dict</> file looks like the <filename>.dict</> file of
+     <application>Ispell</>:
+ <programlisting>
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ </programlisting>
+    </para>
+ 
     <note>
      <para>
       <application>MySpell</> does not support compound words.
*** a/src/backend/tsearch/Makefile
--- b/src/backend/tsearch/Makefile
***************
*** 13,20 **** include $(top_builddir)/src/Makefile.global
  
  DICTDIR=tsearch_data
  
! DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \
! 	ispell_sample.affix ispell_sample.dict
  
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
  	dict_simple.o dict_synonym.o dict_thesaurus.o \
--- 13,23 ----
  
  DICTDIR=tsearch_data
  
! DICTFILES=dicts/synonym_sample.syn dicts/thesaurus_sample.ths \
! 	dicts/hunspell_sample.affix \
! 	dicts/ispell_sample.affix dicts/ispell_sample.dict \
! 	dicts/hunspell_sample_long.affix dicts/hunspell_sample_long.dict \
! 	dicts/hunspell_sample_num.affix dicts/hunspell_sample_num.dict
  
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
  	dict_simple.o dict_synonym.o dict_thesaurus.o \
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample.affix
***************
*** 0 ****
--- 1,24 ----
+ COMPOUNDFLAG Z
+ ONLYINCOMPOUND L
+ 
+ PFX B Y 1
+ PFX B   0	re	.
+ 
+ PFX U N 1
+ PFX U   0	un	.
+ 
+ SFX J Y 1
+ SFX J   0	INGS	[^E]
+ 
+ SFX G Y 1
+ SFX G   0	ING		[^E]
+ 
+ SFX S Y 1
+ SFX S   0	S	[^SXZHY]
+ 
+ SFX A Y 1
+ SFX A   Y	IES	[^AEIOU]Y
+ 
+ SFX \ N 1
+ SFX \   0	Y/L	[^Y]
+ 
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_long.affix
***************
*** 0 ****
--- 1,35 ----
+ FLAG long
+ 
+ AF 7
+ AF cZ		#1
+ AF cL		#2
+ AF sGsJpUsS	#3
+ AF sSpB		#4
+ AF cZsS		#5
+ AF sScZs\	#6
+ AF sA		#7
+ 
+ COMPOUNDFLAG cZ
+ ONLYINCOMPOUND cL
+ 
+ PFX pB Y 1
+ PFX pB   0	re	.
+ 
+ PFX pU N 1
+ PFX pU   0	un	.
+ 
+ SFX sJ Y 1
+ SFX sJ   0	INGS	[^E]
+ 
+ SFX sG Y 1
+ SFX sG   0	ING		[^E]
+ 
+ SFX sS Y 1
+ SFX sS   0	S	[^SXZHY]
+ 
+ SFX sA Y 1
+ SFX sA   Y	IES	[^AEIOU]Y
+ 
+ SFX s\ N 1
+ SFX s\   0	Y/2	[^Y]
+ 
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_long.dict
***************
*** 0 ****
--- 1,8 ----
+ book/3
+ booking/4
+ footballklubber
+ foot/5
+ football/1
+ ball/6
+ klubber/1
+ sky/7
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_num.affix
***************
*** 0 ****
--- 1,26 ----
+ FLAG num
+ 
+ COMPOUNDFLAG 101
+ ONLYINCOMPOUND 102
+ 
+ PFX 201 Y 1
+ PFX 201   0	re	.
+ 
+ PFX 202 N 1
+ PFX 202   0	un	.
+ 
+ SFX 301 Y 1
+ SFX 301   0	INGS	[^E]
+ 
+ SFX 302 Y 1
+ SFX 302   0	ING		[^E]
+ 
+ SFX 303 Y 1
+ SFX 303   0	S	[^SXZHY]
+ 
+ SFX 304 Y 1
+ SFX 304   Y	IES	[^AEIOU]Y
+ 
+ SFX 305 N 1
+ SFX 305   0	Y/102	[^Y]
+ 
*** /dev/null
--- b/src/backend/tsearch/dicts/hunspell_sample_num.dict
***************
*** 0 ****
--- 1,8 ----
+ book/302,301,202,303
+ booking/303,201
+ footballklubber
+ foot/101,303
+ football/101
+ ball/303,101,305
+ klubber/101
+ sky/304
*** /dev/null
--- b/src/backend/tsearch/dicts/ispell_sample.affix
***************
*** 0 ****
--- 1,26 ----
+ compoundwords controlled Z
+ 
+ prefixes
+ 
+ flag *B:
+ 	.       >   RE      # As in enter > reenter
+ 
+ flag U:
+     .       >   UN      # As in natural > unnatural
+ 
+ suffixes
+ 
+ flag *J:
+ 	[^E]    >   INGS        # As in cross > crossings
+ 
+ flag *G:
+ 	[^E]    >   ING     # As in cross > crossing
+ 
+ flag *S:
+ 	[^SXZHY]    >   S       # As in bat > bats
+ 
+ flag *A:
+ 	[^AEIOU]Y   >   -Y,IES      # As in imply > implies
+ 
+ flag ~\\:
+ 	[^Y]        >   Y              #~ advarsel > advarsely-
*** /dev/null
--- b/src/backend/tsearch/dicts/ispell_sample.dict
***************
*** 0 ****
--- 1,8 ----
+ book/GJUS
+ booking/SB
+ footballklubber
+ foot/ZS
+ football/Z
+ ball/SZ\
+ klubber/Z
+ sky/A
*** /dev/null
--- b/src/backend/tsearch/dicts/synonym_sample.syn
***************
*** 0 ****
--- 1,5 ----
+ postgres	pgsql
+ postgresql	pgsql
+ postgre	pgsql
+ gogle	googl
+ indices	index*
*** /dev/null
--- b/src/backend/tsearch/dicts/thesaurus_sample.ths
***************
*** 0 ****
--- 1,17 ----
+ #
+ # Theasurus config file. Character ':' separates string from replacement, eg
+ # sample-words : substitute-words
+ #
+ # Any substitute-word can be marked by preceding '*' character,
+ # which means do not lexize this word
+ # Docs: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Thesaurus_dictionary
+ 
+ one two three : *123
+ one two : *12
+ one : *1
+ two : *2
+ 
+ supernovae stars : *sn
+ supernovae : *sn
+ booking tickets : order invitation cards
+ booking ? tickets : order invitation Cards
*** a/src/backend/tsearch/hunspell_sample.affix
--- /dev/null
***************
*** 1,24 ****
- COMPOUNDFLAG Z
- ONLYINCOMPOUND L
- 
- PFX B Y 1
- PFX B   0	re	.
- 
- PFX U N 1
- PFX U   0	un	.
- 
- SFX J Y 1
- SFX J   0	INGS	[^E]
- 
- SFX G Y 1
- SFX G   0	ING		[^E]
- 
- SFX S Y 1
- SFX S   0	S	[^SXZHY]
- 
- SFX A Y 1
- SFX A   Y	IES	[^AEIOU]Y
- 
- SFX \ N 1
- SFX \   0	Y/L	[^Y]
- 
--- 0 ----
*** a/src/backend/tsearch/ispell_sample.affix
--- /dev/null
***************
*** 1,26 ****
- compoundwords controlled Z
- 
- prefixes
- 
- flag *B:
- 	.       >   RE      # As in enter > reenter
- 
- flag U:
-     .       >   UN      # As in natural > unnatural
- 
- suffixes
- 
- flag *J:
- 	[^E]    >   INGS        # As in cross > crossings
- 
- flag *G:
- 	[^E]    >   ING     # As in cross > crossing
- 
- flag *S:
- 	[^SXZHY]    >   S       # As in bat > bats
- 
- flag *A:
- 	[^AEIOU]Y   >   -Y,IES      # As in imply > implies
- 
- flag ~\\:
- 	[^Y]        >   Y              #~ advarsel > advarsely-
--- 0 ----
*** a/src/backend/tsearch/ispell_sample.dict
--- /dev/null
***************
*** 1,8 ****
- book/GJUS
- booking/SB
- footballklubber
- foot/ZS
- football/Z
- ball/SZ\
- klubber/Z
- sky/A
--- 0 ----
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***************
*** 5,10 ****
--- 5,58 ----
   *
   * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
   *
+  * Ispell dictionary
+  * -----------------
+  *
+  * Rules of dictionaries are defined in two files with .affix and .dict
+  * extensions. They are used by spell checker programs Ispell and Hunspell.
+  *
+  * An .affix file declares morphological rules to get a basic form of words.
+  * The format of an .affix file has different structure for Ispell and Hunspell
+  * dictionaries. The Hunspell format is more complicated. But when an .affix
+  * file is imported and compiled, it is stored in the same structure AffixNode.
+  *
+  * A .dict file stores a list of basic forms of words with references to
+  * affix rules. The format of a .dict file has the same structure for Ispell
+  * and Hunspell dictionaries.
+  *
+  * Compilation of a dictionary
+  * ---------------------------
+  *
+  * A compiled dictionary is stored in the IspellDict structure. Compilation of
+  * a dictionary is divided into the several steps:
+  *  - NIImportDictionary() - stores each word of a .dict file in the
+  *    temporary Spell field.
+  *  - NIImportAffixes() - stores affix rules of an .affix file in the
+  *    Affix field (not temporary) if an .affix file has the Ispell format.
+  *    -> NIImportOOAffixes() - stores affix rules if an .affix file has the
+  *       Hunspell format. The AffixData field is initialized if AF parameter
+  *       is defined.
+  *  - NISortDictionary() - builds a prefix tree (Trie) from the words list
+  *    and stores it in the Dictionary field. The words list is got from the
+  *    Spell field. The AffixData field is initialized if AF parameter is not
+  *    defined.
+  *  - NISortAffixes():
+  *    - builds a list of compond affixes from the affix list and stores it
+  *      in the CompoundAffix.
+  *    - builds prefix trees (Trie) from the affix list for prefixes and suffixes
+  *      and stores them in Suffix and Prefix fields.
+  *    The affix list is got from the Affix field.
+  *
+  * Memory management
+  * -----------------
+  *
+  * The IspellDict structure has the Spell field which is used only in compile
+  * time. The Spell field stores a words list. It can take a lot of memory.
+  * Therefore when a dictionary is compiled this field is cleared by
+  * NIFinishBuild().
+  *
+  * All resources which should cleared by NIFinishBuild() is initialized using
+  * tmpalloc() and tmpalloc0().
   *
   * IDENTIFICATION
   *	  src/backend/tsearch/spell.c
***************
*** 153,159 **** cmpspell(const void *s1, const void *s2)
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
  }
  
  static char *
--- 201,208 ----
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strcmp((*(SPELL *const *) s1)->p.flag,
! 					(*(SPELL *const *) s2)->p.flag));
  }
  
  static char *
***************
*** 220,225 **** strbncmp(const unsigned char *s1, const unsigned char *s2, size_t count)
--- 269,279 ----
  	return 0;
  }
  
+ /*
+  * Compares affixes.
+  * First compares the type of an affix. Prefixes should go before affixes.
+  * If types are equal then compares replaceable string.
+  */
  static int
  cmpaffix(const void *s1, const void *s2)
  {
***************
*** 237,242 **** cmpaffix(const void *s1, const void *s2)
--- 291,425 ----
  					   (const unsigned char *) a2->repl);
  }
  
+ /*
+  * Gets an affix flag from string representation (a set of affixes).
+  *
+  * Several flags can be stored in a single string. Flags can be represented by:
+  * - 1 character (FM_CHAR).
+  * - 2 characters (FM_LONG).
+  * - numbers from 1 to 65000 (FM_NUM).
+  *
+  * Depending on the flagMode an affix string can have the following format:
+  * - FM_CHAR: ABCD
+  *   Here we have 4 flags: A, B, C and D
+  * - FM_LONG: ABCDE*
+  *   Here we have 3 flags: AB, CD and E*
+  * - FM_NUM: 200,205,50
+  *   Here we have 3 flags: 200, 205 and 50
+  *
+  * Conf: current dictionary.
+  * sflag: string representation (a set of affixes) of an affix flag.
+  * sflagnext: returns reference to the start of a next affix flag in the sflag.
+  *
+  * Returns an integer representation of the affix flag.
+  */
+ static unsigned short
+ DecodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	int64		s;
+ 	char	   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)(((unsigned char *)sflag)[0]) << 8
+ 				| (int)(((unsigned char *)sflag)[1]);
+ 			if (sflagnext)
+ 				/* Go to start of the next flag */
+ 				*sflagnext = sflag + pg_mblen(sflag) * 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = strtol(sflag, &next, 10);
+ 			if (s >= FLAGNUM_MAXSIZE)
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
+ 						 errmsg("invalid affix flag \"%s\"", sflag)));
+ 
+ 			if (sflagnext)
+ 			{
+ 				/* Go to start of the next flag */
+ 				if (next)
+ 				{
+ 					*sflagnext = next;
+ 					while (**sflagnext)
+ 					{
+ 						if (**sflagnext == ',')
+ 						{
+ 							/* Found start of the next flag */
+ 							*sflagnext += pg_mblen(*sflagnext);
+ 							break;
+ 						}
+ 						*sflagnext += pg_mblen(*sflagnext);
+ 					}
+ 				}
+ 				else
+ 					*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (int64) *((unsigned char *)sflag);
+ 			if (s >= FLAGCHAR_MAXSIZE)
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
+ 						 errmsg("invalid affix flag \"%s\"", sflag)));
+ 
+ 			if (sflagnext)
+ 				/* Go to start of the next flag */
+ 				*sflagnext = sflag + pg_mblen(sflag);
+ 	}
+ 
+ 	return s;
+ }
+ 
+ /*
+  * Checks if the affix set Conf->AffixData[affix] contains affixflag.
+  * Conf->AffixData[affix] is the string representation of an affix flags.
+  * Conf->AffixData[affix] does not contain affixflag if this flag is not used
+  * actually by the .dict file.
+  *
+  * Conf: current dictionary.
+  * affix: index of the Conf->AffixData array.
+  * affixflag: integer representation of the affix flag.
+  *
+  * Returns true if the string Conf->AffixData[affix] contains affixflag,
+  * otherwise returns false.
+  */
+ static bool
+ IsAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		/* Compare first affix flag in flagcur with affixflag */
+ 		if (DecodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ 			return true;
+ 		/* Otherwise go to next flag */
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		/* If we have not flags anymore then exit */
+ 		else
+ 			break;
+ 	}
+ 
+ 	/* Could not find affixflag */
+ 	return false;
+ }
+ 
+ /*
+  * Adds the new word into the temporary array Spell.
+  *
+  * Conf: current dictionary.
+  * word: new word.
+  * flag: set of affix flags. Integer representation of flag can be got by
+  *       DecodeFlag().
+  */
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***************
*** 255,268 **** NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
  	Conf->nspell++;
  }
  
  /*
!  * import dictionary
   *
!  * Note caller must already have applied get_tsearch_config_filename
   */
  void
  NIImportDictionary(IspellDict *Conf, const char *filename)
--- 438,455 ----
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
! 		? cpstrdup(Conf, flag) : VoidString;
  	Conf->nspell++;
  }
  
  /*
!  * Imports dictionary into the temporary array Spell.
   *
!  * Note caller must already have applied get_tsearch_config_filename.
!  *
!  * Conf: current dictionary.
!  * filename: path to the .dict file.
   */
  void
  NIImportDictionary(IspellDict *Conf, const char *filename)
***************
*** 280,285 **** NIImportDictionary(IspellDict *Conf, const char *filename)
--- 467,473 ----
  	{
  		char	   *s,
  				   *pstr;
+ 		/* Set of affix flags */
  		const char *flag;
  
  		/* Extract flag from the line */
***************
*** 324,330 **** NIImportDictionary(IspellDict *Conf, const char *filename)
  	tsearch_readline_end(&trst);
  }
  
! 
  static int
  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  {
--- 512,541 ----
  	tsearch_readline_end(&trst);
  }
  
! /*
!  * Searches a basic form of word in the prefix tree. This word was generated
!  * using an affix rule. This rule may not be presented in an affix set of
!  * a basic form of word.
!  *
!  * For example, we have the entry in the .dict file:
!  * meter/GMD
!  *
!  * The affix rule with the flag S:
!  * SFX S   y     ies        [^aeiou]y
!  * is not presented here.
!  *
!  * The affix rule with the flag M:
!  * SFX M   0     's         .
!  * is presented here.
!  *
!  * Conf: current dictionary.
!  * word: basic form of word.
!  * affixflag: integer representation of the affix flag, by which a basic form of
!  *            word was generated.
!  * flag: compound flag used to compare with StopMiddle->compoundflag.
!  *
!  * Returns 1 if the word was found in the prefix tree, else returns 0.
!  */
  static int
  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  {
***************
*** 349,361 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  				{
  					if (flag == 0)
  					{
  						if (StopMiddle->compoundflag & FF_COMPOUNDONLY)
  							return 0;
  					}
  					else if ((flag & StopMiddle->compoundflag) == 0)
  						return 0;
  
! 					if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  						return 1;
  				}
  				node = StopMiddle->node;
--- 560,581 ----
  				{
  					if (flag == 0)
  					{
+ 						/*
+ 						 * The word can be formed only with another word.
+ 						 * And in the flag parameter there is not a sign
+ 						 * that we search compound words.
+ 						 */
  						if (StopMiddle->compoundflag & FF_COMPOUNDONLY)
  							return 0;
  					}
  					else if ((flag & StopMiddle->compoundflag) == 0)
  						return 0;
  
! 					/*
! 					 * Check if this affix rule is presented in the affix set
! 					 * with index StopMiddle->affix.
! 					 */
! 					if (IsAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  						return 1;
  				}
  				node = StopMiddle->node;
***************
*** 373,378 **** FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
--- 593,616 ----
  	return 0;
  }
  
+ /*
+  * Adds a new affix rule to the Affix field.
+  *
+  * Conf: current dictionary.
+  * flag: integer representation of the affix flag ('\' in the below example).
+  * flagflags: set of flags from the flagval field for this affix rule. This set
+  *            is listed after '/' character in the added string (repl).
+  *
+  *            For example L flag in the hunspell_sample.affix:
+  *            SFX \   0	Y/L	[^Y]
+  *
+  * mask: condition for search ('[^Y]' in the above example).
+  * find: stripping characters from beginning (at prefix) or end (at suffix)
+  *       of the word ('0' in the above example, 0 means that there is not
+  *       stripping character).
+  * repl: adding string after stripping ('Y' in the above example).
+  * type: FF_SUFFIX or FF_PREFIX.
+  */
  static void
  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const char *find, const char *repl, int type)
  {
***************
*** 394,411 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0)
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
  	}
  	else if (RS_isRegis(mask))
  	{
  		Affix->issimple = 0;
  		Affix->isregis = 1;
! 		RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX) ? true : false,
  				   *mask ? mask : VoidString);
  	}
  	else
  	{
  		int			masklen;
--- 632,652 ----
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	/* This affix rule can be applied for words with any ending */
! 	if (strcmp(mask, ".") == 0 || *mask == '\0')
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
  	}
+ 	/* This affix rule will use regis to search word ending */
  	else if (RS_isRegis(mask))
  	{
  		Affix->issimple = 0;
  		Affix->isregis = 1;
! 		RS_compile(&(Affix->reg.regis), (type == FF_SUFFIX),
  				   *mask ? mask : VoidString);
  	}
+ 	/* This affix rule will use regex_t to search word ending */
  	else
  	{
  		int			masklen;
***************
*** 457,463 **** NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  	Conf->naffixes++;
  }
  
- 
  /* Parsing states for parse_affentry() and friends */
  #define PAE_WAIT_MASK	0
  #define PAE_INMASK		1
--- 698,703 ----
***************
*** 712,720 **** parse_affentry(char *str, char *mask, char *find, char *repl)
  
  	*pmask = *pfind = *prepl = '\0';
  
! 	return (*mask && (*find || *repl)) ? true : false;
  }
  
  static void
  addFlagValue(IspellDict *Conf, char *s, uint32 val)
  {
--- 952,967 ----
  
  	*pmask = *pfind = *prepl = '\0';
  
! 	return (*mask && (*find || *repl));
  }
  
+ /*
+  * Sets up a correspondence for the affix parameter with the affix flag.
+  *
+  * Conf: current dictionary.
+  * s: affix flag in string.
+  * val: affix parameter.
+  */
  static void
  addFlagValue(IspellDict *Conf, char *s, uint32 val)
  {
***************
*** 731,742 **** addFlagValue(IspellDict *Conf, char *s, uint32 val)
  				(errcode(ERRCODE_CONFIG_FILE_ERROR),
  				 errmsg("multibyte flag character is not allowed")));
  
! 	Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
  	Conf->usecompound = true;
  }
  
  /*
!  * Import an affix file that follows MySpell or Hunspell format
   */
  static void
  NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 978,1043 ----
  				(errcode(ERRCODE_CONFIG_FILE_ERROR),
  				 errmsg("multibyte flag character is not allowed")));
  
! 	Conf->flagval[DecodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
  	Conf->usecompound = true;
  }
  
  /*
!  * Returns a set of affix parameters which correspondence to the set of affix
!  * flags s.
!  */
! static int
! getFlagValues(IspellDict *Conf, char *s)
! {
! 	uint32	 flag = 0;
! 	char	*flagcur;
! 	char	*flagnext = 0;
! 
! 	flagcur = s;
! 	while (*flagcur)
! 	{
! 		flag |= Conf->flagval[DecodeFlag(Conf, flagcur, &flagnext)];
! 		if (flagnext)
! 			flagcur = flagnext;
! 		else
! 			break;
! 	}
! 
! 	return flag;
! }
! 
! /*
!  * Returns a flag set using the s parameter.
!  *
!  * If Conf->useFlagAliases is true then the s parameter is index of the
!  * Conf->AffixData array and function returns its entry.
!  * Else function returns the s parameter.
!  */
! static char *
! getFlags(IspellDict *Conf, char *s)
! {
! 	int curaffix;
! 	if (Conf->useFlagAliases)
! 	{
! 		curaffix = strtol(s, (char **)NULL, 10);
! 		if (curaffix && curaffix <= Conf->nAffixData)
! 			/*
! 			 * Do not substract 1 from curaffix
! 			 * because empty string was added in NIImportOOAffixes
! 			 */
! 			return Conf->AffixData[curaffix];
! 		else
! 			return VoidString;
! 	}
! 	else
! 		return s;
! }
! 
! /*
!  * Import an affix file that follows MySpell or Hunspell format.
!  *
!  * Conf: current dictionary.
!  * filename: path to the .affix file.
   */
  static void
  NIImportOOAffixes(IspellDict *Conf, const char *filename)
***************
*** 751,757 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
  	char		repl[BUFSIZ],
  			   *prepl;
  	bool		isSuffix = false;
! 	int			flag = 0;
  	char		flagflags = 0;
  	tsearch_readline_state trst;
  	char	   *recoded;
--- 1052,1061 ----
  	char		repl[BUFSIZ],
  			   *prepl;
  	bool		isSuffix = false;
! 	int			naffix = 0,
! 				curaffix = 0;
! 	int			flag = 0,
! 				sflaglen = 0;
  	char		flagflags = 0;
  	tsearch_readline_state trst;
  	char	   *recoded;
***************
*** 759,764 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
--- 1063,1070 ----
  	/* read file to find any flag */
  	memset(Conf->flagval, 0, sizeof(Conf->flagval));
  	Conf->usecompound = false;
+ 	Conf->useFlagAliases = false;
+ 	Conf->flagMode = FM_CHAR;
  
  	if (!tsearch_readline_begin(&trst, filename))
  		ereport(ERROR,
***************
*** 806,815 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
  			while (*s && t_isspace(s))
  				s += pg_mblen(s);
  
! 			if (*s && STRNCMP(s, "default") != 0)
! 				ereport(ERROR,
  						(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 						 errmsg("Ispell dictionary supports only default flag value")));
  		}
  
  		pfree(recoded);
--- 1112,1129 ----
  			while (*s && t_isspace(s))
  				s += pg_mblen(s);
  
! 			if (*s)
! 			{
! 				if (STRNCMP(s, "long") == 0)
! 					Conf->flagMode = FM_LONG;
! 				else if (STRNCMP(s, "num") == 0)
! 					Conf->flagMode = FM_NUM;
! 				else if (STRNCMP(s, "default") != 0)
! 					ereport(ERROR,
  						(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 						 errmsg("Ispell dictionary supports only default, "
! 						 		"long and num flag value")));
! 			}
  		}
  
  		pfree(recoded);
***************
*** 834,860 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
  		if (ptype)
  			pfree(ptype);
  		ptype = lowerstr_ctx(Conf, type);
  		if (fields_read < 4 ||
  			(STRNCMP(ptype, "sfx") != 0 && STRNCMP(ptype, "pfx") != 0))
  			goto nextline;
  
  		if (fields_read == 4)
  		{
! 			if (strlen(sflag) != 1)
! 				goto nextline;
! 			flag = *sflag;
! 			isSuffix = (STRNCMP(ptype, "sfx") == 0) ? true : false;
  			if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
  				flagflags = FF_CROSSPRODUCT;
  			else
  				flagflags = 0;
  		}
  		else
  		{
  			char	   *ptr;
  			int			aflg = 0;
  
! 			if (strlen(sflag) != 1 || flag != *sflag || flag == 0)
  				goto nextline;
  			prepl = lowerstr_ctx(Conf, repl);
  			/* Find position of '/' in lowercased string "prepl" */
--- 1148,1224 ----
  		if (ptype)
  			pfree(ptype);
  		ptype = lowerstr_ctx(Conf, type);
+ 
+ 		/* First try to parse AF parameter (alias compression) */
+ 		if (STRNCMP(ptype, "af") == 0)
+ 		{
+ 			/* First line is the number of aliases */
+ 			if (!Conf->useFlagAliases)
+ 			{
+ 				Conf->useFlagAliases = true;
+ 				naffix = atoi(sflag);
+ 				if (naffix == 0)
+ 					ereport(ERROR,
+ 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
+ 						 errmsg("invalid number of flag vector aliases")));
+ 
+ 				/* Also reserve place for empty flag set */
+ 				naffix++;
+ 
+ 				Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
+ 				Conf->lenAffixData = Conf->nAffixData = naffix;
+ 
+ 				/* Add empty flag set into AffixData */
+ 				Conf->AffixData[curaffix] = VoidString;
+ 				curaffix++;
+ 			}
+ 			/* Other lines is aliases */
+ 			else
+ 			{
+ 				if (curaffix < naffix)
+ 				{
+ 					Conf->AffixData[curaffix] = cpstrdup(Conf, sflag);
+ 					curaffix++;
+ 				}
+ 			}
+ 			goto nextline;
+ 		}
+ 		/* Else try to parse prefixes and suffixes */
  		if (fields_read < 4 ||
  			(STRNCMP(ptype, "sfx") != 0 && STRNCMP(ptype, "pfx") != 0))
  			goto nextline;
  
+ 		sflaglen = strlen(sflag);
+ 		if (sflaglen == 0
+ 			|| (sflaglen > 1 && Conf->flagMode == FM_CHAR)
+ 			|| (sflaglen > 2 && Conf->flagMode == FM_LONG))
+ 			goto nextline;
+ 
+ 		/*
+ 		 * Affix header. For example:
+ 		 * SFX \ N 1
+ 		 */
  		if (fields_read == 4)
  		{
! 			/* Convert the affix flag to int */
! 			flag = DecodeFlag(Conf, sflag, (char **)NULL);
! 
! 			isSuffix = (STRNCMP(ptype, "sfx") == 0);
  			if (t_iseq(find, 'y') || t_iseq(find, 'Y'))
  				flagflags = FF_CROSSPRODUCT;
  			else
  				flagflags = 0;
  		}
+ 		/*
+ 		 * Affix fields. For example:
+ 		 * SFX \   0	Y/L	[^Y]
+ 		 */
  		else
  		{
  			char	   *ptr;
  			int			aflg = 0;
  
! 			if (flag == 0)
  				goto nextline;
  			prepl = lowerstr_ctx(Conf, repl);
  			/* Find position of '/' in lowercased string "prepl" */
***************
*** 866,876 **** NIImportOOAffixes(IspellDict *Conf, const char *filename)
  				 */
  				*ptr = '\0';
  				ptr = repl + (ptr - prepl) + 1;
! 				while (*ptr)
! 				{
! 					aflg |= Conf->flagval[*(unsigned char *) ptr];
! 					ptr++;
! 				}
  			}
  			pfind = lowerstr_ctx(Conf, find);
  			pmask = lowerstr_ctx(Conf, mask);
--- 1230,1236 ----
  				 */
  				*ptr = '\0';
  				ptr = repl + (ptr - prepl) + 1;
! 				aflg |= getFlagValues(Conf, getFlags(Conf, ptr));
  			}
  			pfind = lowerstr_ctx(Conf, find);
  			pmask = lowerstr_ctx(Conf, mask);
***************
*** 928,933 **** NIImportAffixes(IspellDict *Conf, const char *filename)
--- 1288,1295 ----
  
  	memset(Conf->flagval, 0, sizeof(Conf->flagval));
  	Conf->usecompound = false;
+ 	Conf->useFlagAliases = false;
+ 	Conf->flagMode = FM_CHAR;
  
  	while ((recoded = tsearch_readline(&trst)) != NULL)
  	{
***************
*** 1044,1049 **** isnewformat:
--- 1406,1417 ----
  	NIImportOOAffixes(Conf, filename);
  }
  
+ /*
+  * Merges two affix flag sets and stores a new affix flag set into
+  * Conf->AffixData.
+  *
+  * Returns index of a new affix flag set.
+  */
  static int
  MergeAffix(IspellDict *Conf, int a1, int a2)
  {
***************
*** 1068,1088 **** MergeAffix(IspellDict *Conf, int a1, int a2)
  	return Conf->nAffixData - 1;
  }
  
  static uint32
  makeCompoundFlags(IspellDict *Conf, int affix)
  {
! 	uint32		flag = 0;
! 	char	   *str = Conf->AffixData[affix];
! 
! 	while (str && *str)
! 	{
! 		flag |= Conf->flagval[*(unsigned char *) str];
! 		str++;
! 	}
! 
! 	return (flag & FF_DICTFLAGMASK);
  }
  
  static SPNode *
  mkSPNode(IspellDict *Conf, int low, int high, int level)
  {
--- 1436,1460 ----
  	return Conf->nAffixData - 1;
  }
  
+ /*
+  * Returns a set of affix parameters which correspondence to the set of affix
+  * flags with the given index.
+  */
  static uint32
  makeCompoundFlags(IspellDict *Conf, int affix)
  {
! 	char *str = Conf->AffixData[affix];
! 	return (getFlagValues(Conf, str) & FF_DICTFLAGMASK);
  }
  
+ /*
+  * Makes a prefix tree for the given level.
+  *
+  * Conf: current dictionary.
+  * low: lower index of the Conf->Spell array.
+  * high: upper index of the Conf->Spell array.
+  * level: current prefix tree level.
+  */
  static SPNode *
  mkSPNode(IspellDict *Conf, int low, int high, int level)
  {
***************
*** 1115,1120 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
--- 1487,1493 ----
  			{
  				if (lastchar)
  				{
+ 					/* Next level of the prefix tree */
  					data->node = mkSPNode(Conf, lownew, i, level + 1);
  					lownew = i;
  					data++;
***************
*** 1154,1159 **** mkSPNode(IspellDict *Conf, int low, int high, int level)
--- 1527,1533 ----
  			}
  		}
  
+ 	/* Next level of the prefix tree */
  	data->node = mkSPNode(Conf, lownew, high, level + 1);
  
  	return rs;
***************
*** 1172,1215 **** NISortDictionary(IspellDict *Conf)
  
  	/* compress affixes */
  
- 	/* Count the number of different flags used in the dictionary */
- 
- 	qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspellaffix);
- 
- 	naffix = 0;
- 	for (i = 0; i < Conf->nspell; i++)
- 	{
- 		if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag, MAXFLAGLEN))
- 			naffix++;
- 	}
- 
  	/*
! 	 * Fill in Conf->AffixData with the affixes that were used in the
! 	 * dictionary. Replace textual flag-field of Conf->Spell entries with
! 	 * indexes into Conf->AffixData array.
  	 */
! 	Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
! 
! 	curaffix = -1;
! 	for (i = 0; i < Conf->nspell; i++)
  	{
! 		if (i == 0 || strncmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix], MAXFLAGLEN))
  		{
! 			curaffix++;
! 			Assert(curaffix < naffix);
! 			Conf->AffixData[curaffix] = cpstrdup(Conf, Conf->Spell[i]->p.flag);
  		}
- 
- 		Conf->Spell[i]->p.d.affix = curaffix;
- 		Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
  	}
  
! 	Conf->lenAffixData = Conf->nAffixData = naffix;
  
  	qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
  	Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
  }
  
  static AffixNode *
  mkANode(IspellDict *Conf, int low, int high, int level, int type)
  {
--- 1546,1628 ----
  
  	/* compress affixes */
  
  	/*
! 	 * If we use flag aliases then we need to use Conf->AffixData filled
! 	 * in the NIImportOOAffixes().
  	 */
! 	if (Conf->useFlagAliases)
  	{
! 		for (i = 0; i < Conf->nspell; i++)
  		{
! 			curaffix = strtol(Conf->Spell[i]->p.flag, (char **)NULL, 10);
! 			if (curaffix && curaffix <= Conf->nAffixData)
! 				Conf->Spell[i]->p.d.affix = curaffix;
! 			else
! 				/*
! 				 * If Conf->Spell[i]->p.flag is empty, then get empty value of
! 				 * Conf->AffixData (0 index).
! 				 */
! 				Conf->Spell[i]->p.d.affix = 0;
! 			Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
  		}
  	}
+ 	/* Otherwise fill Conf->AffixData here */
+ 	else
+ 	{
+ 		/* Count the number of different flags used in the dictionary */
+ 		qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *),
+ 				cmpspellaffix);
+ 
+ 		naffix = 0;
+ 		for (i = 0; i < Conf->nspell; i++)
+ 		{
+ 			if (i == 0
+ 				|| strcmp(Conf->Spell[i]->p.flag, Conf->Spell[i - 1]->p.flag))
+ 				naffix++;
+ 		}
  
! 		/*
! 		 * Fill in Conf->AffixData with the affixes that were used in the
! 		 * dictionary. Replace textual flag-field of Conf->Spell entries with
! 		 * indexes into Conf->AffixData array.
! 		 */
! 		Conf->AffixData = (char **) palloc0(naffix * sizeof(char *));
! 
! 		curaffix = -1;
! 		for (i = 0; i < Conf->nspell; i++)
! 		{
! 			if (i == 0
! 				|| strcmp(Conf->Spell[i]->p.flag, Conf->AffixData[curaffix]))
! 			{
! 				curaffix++;
! 				Assert(curaffix < naffix);
! 				Conf->AffixData[curaffix] = cpstrdup(Conf,
! 													Conf->Spell[i]->p.flag);
! 			}
! 
! 			Conf->Spell[i]->p.d.affix = curaffix;
! 			Conf->Spell[i]->p.d.len = strlen(Conf->Spell[i]->word);
! 		}
! 
! 		Conf->lenAffixData = Conf->nAffixData = naffix;
! 	}
  
+ 	/* Start build a prefix tree */
  	qsort((void *) Conf->Spell, Conf->nspell, sizeof(SPELL *), cmpspell);
  	Conf->Dictionary = mkSPNode(Conf, 0, Conf->nspell, 0);
  }
  
+ /*
+  * Makes a prefix tree for the given level using the repl string of an affix
+  * rule. Affixes with empty replace string do not include in the prefix tree.
+  * This affixes are included by mkVoidAffix().
+  *
+  * Conf: current dictionary.
+  * low: lower index of the Conf->Affix array.
+  * high: upper index of the Conf->Affix array.
+  * level: current prefix tree level.
+  * type: FF_SUFFIX or FF_PREFIX.
+  */
  static AffixNode *
  mkANode(IspellDict *Conf, int low, int high, int level, int type)
  {
***************
*** 1247,1252 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1660,1666 ----
  			{
  				if (lastchar)
  				{
+ 					/* Next level of the prefix tree */
  					data->node = mkANode(Conf, lownew, i, level + 1, type);
  					if (naff)
  					{
***************
*** 1267,1272 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1681,1687 ----
  			}
  		}
  
+ 	/* Next level of the prefix tree */
  	data->node = mkANode(Conf, lownew, high, level + 1, type);
  	if (naff)
  	{
***************
*** 1281,1286 **** mkANode(IspellDict *Conf, int low, int high, int level, int type)
--- 1696,1705 ----
  	return rs;
  }
  
+ /*
+  * Makes the root void node in the prefix tree. The root void node is created
+  * for affixes which have empty replace string ("repl" field).
+  */
  static void
  mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
  {
***************
*** 1304,1314 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
  		Conf->Prefix = Affix;
  	}
  
! 
  	for (i = start; i < end; i++)
  		if (Conf->Affix[i].replen == 0)
  			cnt++;
  
  	if (cnt == 0)
  		return;
  
--- 1723,1734 ----
  		Conf->Prefix = Affix;
  	}
  
! 	/* Count affixes with empty replace string */
  	for (i = start; i < end; i++)
  		if (Conf->Affix[i].replen == 0)
  			cnt++;
  
+ 	/* There is not affixes with empty replace string */
  	if (cnt == 0)
  		return;
  
***************
*** 1324,1341 **** mkVoidAffix(IspellDict *Conf, bool issuffix, int startsuffix)
  		}
  }
  
  static bool
! isAffixInUse(IspellDict *Conf, char flag)
  {
  	int			i;
  
  	for (i = 0; i < Conf->nAffixData; i++)
! 		if (strchr(Conf->AffixData[i], flag) != NULL)
  			return true;
  
  	return false;
  }
  
  void
  NISortAffixes(IspellDict *Conf)
  {
--- 1744,1774 ----
  		}
  }
  
+ /*
+  * Checks if the affixflag is used by dictionary. Conf->AffixData does not
+  * contain affixflag if this flag is not used actually by the .dict file.
+  *
+  * Conf: current dictionary.
+  * affixflag: integer representation of the affix flag.
+  *
+  * Returns true if the Conf->AffixData array contains affixflag, otherwise
+  * returns false.
+  */
  static bool
! isAffixInUse(IspellDict *Conf, unsigned short affixflag)
  {
  	int			i;
  
  	for (i = 0; i < Conf->nAffixData; i++)
! 		if (IsAffixFlagInUse(Conf, i, affixflag))
  			return true;
  
  	return false;
  }
  
+ /*
+  * Builds Conf->Prefix and Conf->Suffix trees from the imported affixes.
+  */
  void
  NISortAffixes(IspellDict *Conf)
  {
***************
*** 1347,1352 **** NISortAffixes(IspellDict *Conf)
--- 1780,1786 ----
  	if (Conf->naffixes == 0)
  		return;
  
+ 	/* Store compound affixes in the Conf->CompoundAffix array */
  	if (Conf->naffixes > 1)
  		qsort((void *) Conf->Affix, Conf->naffixes, sizeof(AFFIX), cmpaffix);
  	Conf->CompoundAffix = ptr = (CMPDAffix *) palloc(sizeof(CMPDAffix) * Conf->naffixes);
***************
*** 1359,1365 **** NISortAffixes(IspellDict *Conf)
  			firstsuffix = i;
  
  		if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! 			isAffixInUse(Conf, (char) Affix->flag))
  		{
  			if (ptr == Conf->CompoundAffix ||
  				ptr->issuffix != (ptr - 1)->issuffix ||
--- 1793,1799 ----
  			firstsuffix = i;
  
  		if ((Affix->flagflags & FF_COMPOUNDFLAG) && Affix->replen > 0 &&
! 			isAffixInUse(Conf, Affix->flag))
  		{
  			if (ptr == Conf->CompoundAffix ||
  				ptr->issuffix != (ptr - 1)->issuffix ||
***************
*** 1370,1376 **** NISortAffixes(IspellDict *Conf)
  				/* leave only unique and minimals suffixes */
  				ptr->affix = Affix->repl;
  				ptr->len = Affix->replen;
! 				ptr->issuffix = (Affix->type == FF_SUFFIX) ? true : false;
  				ptr++;
  			}
  		}
--- 1804,1810 ----
  				/* leave only unique and minimals suffixes */
  				ptr->affix = Affix->repl;
  				ptr->len = Affix->replen;
! 				ptr->issuffix = (Affix->type == FF_SUFFIX);
  				ptr++;
  			}
  		}
***************
*** 1378,1383 **** NISortAffixes(IspellDict *Conf)
--- 1812,1818 ----
  	ptr->affix = NULL;
  	Conf->CompoundAffix = (CMPDAffix *) repalloc(Conf->CompoundAffix, sizeof(CMPDAffix) * (ptr - Conf->CompoundAffix + 1));
  
+ 	/* Start build a prefix tree */
  	Conf->Prefix = mkANode(Conf, 0, firstsuffix, 0, FF_PREFIX);
  	Conf->Suffix = mkANode(Conf, firstsuffix, Conf->naffixes, 0, FF_SUFFIX);
  	mkVoidAffix(Conf, true, firstsuffix);
***************
*** 1825,1831 **** SplitToVariants(IspellDict *Conf, SPNode *snode, SplitVar *orig, char *word, int
  
  		if (StopLow < StopHigh)
  		{
! 			if (level == FF_COMPOUNDBEGIN)
  				compoundflag = FF_COMPOUNDBEGIN;
  			else if (level == wordlen - 1)
  				compoundflag = FF_COMPOUNDLAST;
--- 2260,2266 ----
  
  		if (StopLow < StopHigh)
  		{
! 			if (startpos == 0)
  				compoundflag = FF_COMPOUNDBEGIN;
  			else if (level == wordlen - 1)
  				compoundflag = FF_COMPOUNDLAST;
*** a/src/backend/tsearch/synonym_sample.syn
--- /dev/null
***************
*** 1,5 ****
- postgres	pgsql
- postgresql	pgsql
- postgre	pgsql
- gogle	googl
- indices	index*
--- 0 ----
*** a/src/backend/tsearch/thesaurus_sample.ths
--- /dev/null
***************
*** 1,17 ****
- #
- # Theasurus config file. Character ':' separates string from replacement, eg
- # sample-words : substitute-words
- #
- # Any substitute-word can be marked by preceding '*' character,
- # which means do not lexize this word
- # Docs: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Thesaurus_dictionary
- 
- one two three : *123
- one two : *12
- one : *1
- two : *2
- 
- supernovae stars : *sn
- supernovae : *sn
- booking tickets : order invitation cards
- booking ? tickets : order invitation Cards
--- 0 ----
*** a/src/include/tsearch/dicts/spell.h
--- b/src/include/tsearch/dicts/spell.h
***************
*** 19,36 ****
  #include "tsearch/ts_public.h"
  
  /*
!  * Max length of a flag name. Names longer than this will be truncated
!  * to the maximum.
   */
- #define MAXFLAGLEN 16
- 
  struct SPNode;
  
  typedef struct
  {
  	uint32		val:8,
  				isword:1,
  				compoundflag:4,
  				affix:19;
  	struct SPNode *node;
  } SPNodeData;
--- 19,36 ----
  #include "tsearch/ts_public.h"
  
  /*
!  * SPNode and SPNodeData are used to represent prefix tree (Trie) to store
!  * a words list.
   */
  struct SPNode;
  
  typedef struct
  {
  	uint32		val:8,
  				isword:1,
+ 				/* Stores compound flags listed below */
  				compoundflag:4,
+ 				/* Reference to an entry of the AffixData field */
  				affix:19;
  	struct SPNode *node;
  } SPNodeData;
***************
*** 43,49 **** typedef struct
  #define FF_COMPOUNDBEGIN	0x02
  #define FF_COMPOUNDMIDDLE	0x04
  #define FF_COMPOUNDLAST		0x08
! #define FF_COMPOUNDFLAG		( FF_COMPOUNDBEGIN | FF_COMPOUNDMIDDLE | FF_COMPOUNDLAST )
  #define FF_DICTFLAGMASK		0x0f
  
  typedef struct SPNode
--- 43,50 ----
  #define FF_COMPOUNDBEGIN	0x02
  #define FF_COMPOUNDMIDDLE	0x04
  #define FF_COMPOUNDLAST		0x08
! #define FF_COMPOUNDFLAG		( FF_COMPOUNDBEGIN | FF_COMPOUNDMIDDLE | \
! 							FF_COMPOUNDLAST )
  #define FF_DICTFLAGMASK		0x0f
  
  typedef struct SPNode
***************
*** 54,72 **** typedef struct SPNode
  
  #define SPNHDRSZ	(offsetof(SPNode,data))
  
! 
  typedef struct spell_struct
  {
  	union
  	{
  		/*
! 		 * flag is filled in by NIImportDictionary. After NISortDictionary, d
! 		 * is valid and flag is invalid.
  		 */
! 		char		flag[MAXFLAGLEN];
  		struct
  		{
  			int			affix;
  			int			len;
  		}			d;
  	}			p;
--- 55,78 ----
  
  #define SPNHDRSZ	(offsetof(SPNode,data))
  
! /*
!  * Represents an entry in a words list.
!  */
  typedef struct spell_struct
  {
  	union
  	{
  		/*
! 		 * flag is filled in by NIImportDictionary(). After NISortDictionary(),
! 		 * d is used instead of flag.
  		 */
! 		char	   *flag;
! 		/* d is used in mkSPNode() */
  		struct
  		{
+ 			/* Reference to an entry of the AffixData field */
  			int			affix;
+ 			/* Length of the word */
  			int			len;
  		}			d;
  	}			p;
***************
*** 75,84 **** typedef struct spell_struct
  
  #define SPELLHDRSZ	(offsetof(SPELL, word))
  
  typedef struct aff_struct
  {
! 	uint32		flag:8,
! 				type:1,
  				flagflags:7,
  				issimple:1,
  				isregis:1,
--- 81,94 ----
  
  #define SPELLHDRSZ	(offsetof(SPELL, word))
  
+ /*
+  * Represents an entry in an affix list.
+  */
  typedef struct aff_struct
  {
! 	uint32		flag:16;
! 				/* FF_SUFFIX or FF_PREFIX */
! 	uint32		type:1,
  				flagflags:7,
  				issimple:1,
  				isregis:1,
***************
*** 106,111 **** typedef struct aff_struct
--- 116,125 ----
  #define FF_SUFFIX				1
  #define FF_PREFIX				0
  
+ /*
+  * AffixNode and AffixNodeData are used to represent prefix tree (Trie) to store
+  * an affix list.
+  */
  struct AffixNode;
  
  typedef struct
***************
*** 132,137 **** typedef struct
--- 146,161 ----
  	bool		issuffix;
  } CMPDAffix;
  
+ typedef enum
+ {
+ 	FM_CHAR,
+ 	FM_LONG,
+ 	FM_NUM
+ } FlagMode;
+ 
+ #define FLAGCHAR_MAXSIZE	(1 << 8)
+ #define FLAGNUM_MAXSIZE		(1 << 16)
+ 
  typedef struct
  {
  	int			maffixes;
***************
*** 142,155 **** typedef struct
  	AffixNode  *Prefix;
  
  	SPNode	   *Dictionary;
  	char	  **AffixData;
  	int			lenAffixData;
  	int			nAffixData;
  
  	CMPDAffix  *CompoundAffix;
  
! 	unsigned char flagval[256];
  	bool		usecompound;
  
  	/*
  	 * Remaining fields are only used during dictionary construction; they are
--- 166,182 ----
  	AffixNode  *Prefix;
  
  	SPNode	   *Dictionary;
+ 	/* Array of sets of affixes */
  	char	  **AffixData;
  	int			lenAffixData;
  	int			nAffixData;
+ 	bool		useFlagAliases;
  
  	CMPDAffix  *CompoundAffix;
  
! 	unsigned char flagval[FLAGNUM_MAXSIZE];
  	bool		usecompound;
+ 	FlagMode	flagMode;
  
  	/*
  	 * Remaining fields are only used during dictionary construction; they are
*** a/src/test/regress/expected/tsdicts.out
--- b/src/test/regress/expected/tsdicts.out
***************
*** 191,196 **** SELECT ts_lexize('hunspell', 'footballyklubber');
--- 191,388 ----
   {foot,ball,klubber}
  (1 row)
  
+ -- Test ISpell dictionary with hunspell affix file with FLAG long parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_long (
+                         Template=ispell,
+                         DictFile=hunspell_sample_long,
+                         AffFile=hunspell_sample_long
+ );
+ SELECT ts_lexize('hunspell_long', 'skies');
+  ts_lexize 
+ -----------
+  {sky}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'bookings');
+    ts_lexize    
+ ----------------
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'booking');
+    ts_lexize    
+ ----------------
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'foot');
+  ts_lexize 
+ -----------
+  {foot}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'foots');
+  ts_lexize 
+ -----------
+  {foot}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'rebookings');
+    ts_lexize    
+ ----------------
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'rebooking');
+    ts_lexize    
+ ----------------
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'rebook');
+  ts_lexize 
+ -----------
+  
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'unbookings');
+  ts_lexize 
+ -----------
+  {book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'unbooking');
+  ts_lexize 
+ -----------
+  {book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'unbook');
+  ts_lexize 
+ -----------
+  {book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'footklubber');
+    ts_lexize    
+ ----------------
+  {foot,klubber}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'footballklubber');
+                       ts_lexize                       
+ ------------------------------------------------------
+  {footballklubber,foot,ball,klubber,football,klubber}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'ballyklubber');
+    ts_lexize    
+ ----------------
+  {ball,klubber}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_long', 'footballyklubber');
+       ts_lexize      
+ ---------------------
+  {foot,ball,klubber}
+ (1 row)
+ 
+ -- Test ISpell dictionary with hunspell affix file with FLAG num parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_num (
+                         Template=ispell,
+                         DictFile=hunspell_sample_num,
+                         AffFile=hunspell_sample_num
+ );
+ SELECT ts_lexize('hunspell_num', 'skies');
+  ts_lexize 
+ -----------
+  {sky}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'bookings');
+    ts_lexize    
+ ----------------
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'booking');
+    ts_lexize    
+ ----------------
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'foot');
+  ts_lexize 
+ -----------
+  {foot}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'foots');
+  ts_lexize 
+ -----------
+  {foot}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'rebookings');
+    ts_lexize    
+ ----------------
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'rebooking');
+    ts_lexize    
+ ----------------
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'rebook');
+  ts_lexize 
+ -----------
+  
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'unbookings');
+  ts_lexize 
+ -----------
+  {book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'unbooking');
+  ts_lexize 
+ -----------
+  {book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'unbook');
+  ts_lexize 
+ -----------
+  {book}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'footklubber');
+    ts_lexize    
+ ----------------
+  {foot,klubber}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'footballklubber');
+                       ts_lexize                       
+ ------------------------------------------------------
+  {footballklubber,foot,ball,klubber,football,klubber}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'ballyklubber');
+    ts_lexize    
+ ----------------
+  {ball,klubber}
+ (1 row)
+ 
+ SELECT ts_lexize('hunspell_num', 'footballyklubber');
+       ts_lexize      
+ ---------------------
+  {foot,ball,klubber}
+ (1 row)
+ 
  -- Synonim dictionary
  CREATE TEXT SEARCH DICTIONARY synonym (
  						Template=synonym,
***************
*** 277,282 **** SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
--- 469,516 ----
   'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
  (1 row)
  
+ -- Test ispell dictionary with hunspell affix with FLAG long in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ 	REPLACE hunspell WITH hunspell_long;
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+                                             to_tsvector                                             
+ ----------------------------------------------------------------------------------------------------
+  'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
+ (1 row)
+ 
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+                                   to_tsquery                                  
+ ------------------------------------------------------------------------------
+  ( 'footballklubber' | 'foot' & 'ball' & 'klubber' ) | 'football' & 'klubber'
+ (1 row)
+ 
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+                                to_tsquery                               
+ ------------------------------------------------------------------------
+  'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
+ (1 row)
+ 
+ -- Test ispell dictionary with hunspell affix with FLAG num in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ 	REPLACE hunspell_long WITH hunspell_num;
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+                                             to_tsvector                                             
+ ----------------------------------------------------------------------------------------------------
+  'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
+ (1 row)
+ 
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+                                   to_tsquery                                  
+ ------------------------------------------------------------------------------
+  ( 'footballklubber' | 'foot' & 'ball' & 'klubber' ) | 'football' & 'klubber'
+ (1 row)
+ 
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+                                to_tsquery                               
+ ------------------------------------------------------------------------
+  'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
+ (1 row)
+ 
  -- Test synonym dictionary in configuration
  CREATE TEXT SEARCH CONFIGURATION synonym_tst (
  						COPY=english
*** a/src/test/regress/sql/tsdicts.sql
--- b/src/test/regress/sql/tsdicts.sql
***************
*** 48,53 **** SELECT ts_lexize('hunspell', 'footballklubber');
--- 48,101 ----
  SELECT ts_lexize('hunspell', 'ballyklubber');
  SELECT ts_lexize('hunspell', 'footballyklubber');
  
+ -- Test ISpell dictionary with hunspell affix file with FLAG long parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_long (
+                         Template=ispell,
+                         DictFile=hunspell_sample_long,
+                         AffFile=hunspell_sample_long
+ );
+ 
+ SELECT ts_lexize('hunspell_long', 'skies');
+ SELECT ts_lexize('hunspell_long', 'bookings');
+ SELECT ts_lexize('hunspell_long', 'booking');
+ SELECT ts_lexize('hunspell_long', 'foot');
+ SELECT ts_lexize('hunspell_long', 'foots');
+ SELECT ts_lexize('hunspell_long', 'rebookings');
+ SELECT ts_lexize('hunspell_long', 'rebooking');
+ SELECT ts_lexize('hunspell_long', 'rebook');
+ SELECT ts_lexize('hunspell_long', 'unbookings');
+ SELECT ts_lexize('hunspell_long', 'unbooking');
+ SELECT ts_lexize('hunspell_long', 'unbook');
+ 
+ SELECT ts_lexize('hunspell_long', 'footklubber');
+ SELECT ts_lexize('hunspell_long', 'footballklubber');
+ SELECT ts_lexize('hunspell_long', 'ballyklubber');
+ SELECT ts_lexize('hunspell_long', 'footballyklubber');
+ 
+ -- Test ISpell dictionary with hunspell affix file with FLAG num parameter
+ CREATE TEXT SEARCH DICTIONARY hunspell_num (
+                         Template=ispell,
+                         DictFile=hunspell_sample_num,
+                         AffFile=hunspell_sample_num
+ );
+ 
+ SELECT ts_lexize('hunspell_num', 'skies');
+ SELECT ts_lexize('hunspell_num', 'bookings');
+ SELECT ts_lexize('hunspell_num', 'booking');
+ SELECT ts_lexize('hunspell_num', 'foot');
+ SELECT ts_lexize('hunspell_num', 'foots');
+ SELECT ts_lexize('hunspell_num', 'rebookings');
+ SELECT ts_lexize('hunspell_num', 'rebooking');
+ SELECT ts_lexize('hunspell_num', 'rebook');
+ SELECT ts_lexize('hunspell_num', 'unbookings');
+ SELECT ts_lexize('hunspell_num', 'unbooking');
+ SELECT ts_lexize('hunspell_num', 'unbook');
+ 
+ SELECT ts_lexize('hunspell_num', 'footklubber');
+ SELECT ts_lexize('hunspell_num', 'footballklubber');
+ SELECT ts_lexize('hunspell_num', 'ballyklubber');
+ SELECT ts_lexize('hunspell_num', 'footballyklubber');
+ 
  -- Synonim dictionary
  CREATE TEXT SEARCH DICTIONARY synonym (
  						Template=synonym,
***************
*** 94,99 **** SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footb
--- 142,163 ----
  SELECT to_tsquery('hunspell_tst', 'footballklubber');
  SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
  
+ -- Test ispell dictionary with hunspell affix with FLAG long in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ 	REPLACE hunspell WITH hunspell_long;
+ 
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+ 
+ -- Test ispell dictionary with hunspell affix with FLAG num in configuration
+ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
+ 	REPLACE hunspell_long WITH hunspell_num;
+ 
+ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
+ SELECT to_tsquery('hunspell_tst', 'footballklubber');
+ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
+ 
  -- Test synonym dictionary in configuration
  CREATE TEXT SEARCH CONFIGURATION synonym_tst (
  						COPY=english

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

Reply via email to