William wrote:
> 
> Thanks for the reply.
> 
>> Have you also modified the index.noun file to account for your changes?
> 
>> index.noun contains a list of byte offsets into data.noun, and any changes to
>> the latter mean the former is invalid.
> 
> I have modified the index.noun too, 
> 
>> Alternatively, I wonder what platform you are working on? Records in the 
>> WordNet
>> files must be terminated by just a single "\x0A". If you are working on a
>> non-Unix platform that uses a multi-character record separator then the 
>> records
>> will be a different length, so invalidating the index file.
> 
> I am working on Linux william-pc 2.6.24-16-generic #1 SMP Thu Apr 10 13:23:42 
> UTC 2008 i686 GNU/Linux
> 
> Ok,
> I got to admit something, after knowing the seek function, only today I
> realize how actually determine the synset id which is equivalient to
> byte offset that you said. Before this I thought the synset id is
> determined by some kind of database auto-increment id/ primary key
> thing. lol.
> 
> Now I realized of course when I added let's say 3 character to the first line 
> and when the seek function try to seek(FH, 00001930, 0) , 
> I will get 
> g)\n00001930 03 n 01 physical_entity 0 007 @ 00001740 n 0000 ~ 00002452 n
> 0000 ~ 00002684 n 0000 ~ 00007347 n 0000 ~ 00020827 n 0000 ~ 00029677 n
> 0000 ~ 14580597 n 0000 | an entity that has physical existence
> 
> 00001740 03 n 02 entity 0 003 ~ 00001930 n 0000 ~ 00002137 n 0000 ~ 04424418 
> n 0000 | that which is perceived or known or inferred to have its own 
> distinct existence (living or nonliving)
> 00001930 03 n 01 physical_entity 0 007 @ 00001740 n 0000 ~ 00002452 n 0000 ~ 
> 00002684 n 0000 ~ 00007347 n 0000 ~ 00020827 n 0000 ~ 00029677 n 0000 ~ 
> 14580597 n 0000 | an entity that has physical existence
> 
> Not wonder it's invalid.
> 
> I wonder what is the reason they arrange the database in such a way ? Is it, 
> it would make the lookup faster ? And what is that index.noun file used for
> when all the information in there is also in data.noun ?
> 
> So now how can I add new synonym words to the WordNet database without
> affecting the original offset bytes ?

You clearly haven't come across file indexing before! Using seek() to locate a
record is incomparably faster than reading through it until you find the data
you need.

Using the file offset as a record ID is a good idea because

- It is bound to be unique
- it is easy to verify that the data hasn't been corrupted

The separate index.noun file is there to make it quick to find all records in
data.noun that apply to a given word.

Editing the database is a non-trivial task. You've found the documentation
already, so take a look at that and write something that allows you to move data
around while keeping the record IDs valid.

Rob

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to