On 5/12/09, Salahuddin Pasha <salahuddi...@gmail.com> wrote: > Dear all, > > I was working on অভিধান - Abhidhan for XML support. To > enable various application and tools to utilize our dictionary. > > Basic work is already done, but we need to define a standard XML (XML > DTD or XML Schema). > > Any suggestion or comments ?
Back in 2003, the bengalinux dictionary list had a discussion on this. Nothing ever came out of it, and when Golam first started on anubadok, his emphasis was more specialized. In any case, that discussion may provide some suggestions. You can get it from the list archives, and I'm also attaching a cleaned up and edited version of the thread here: <thread from May 2003> ---- [Ankur-dictionary] dictionary.dtd From: Kaushik Ghose <kgh...@wa...> - 2003-05-14 04:17 Hi, here is the descriptor file. I'm new to XML and DTDs so please go over the semantics as well as the syntax an see if this serves our purpose... <?xml version="1.0"?> <!ELEMENT entry*(word_bn, info_bn*)> <!ELEMENT word_bn (#CDATA)> <!ELEMENT info_bn (english, pronounciation_bn,meaning_bn)> <!ELEMENT english (#CDATA)> <!ELEMENT pronounciation_bn (#CDATA)> <!ELEMENT meaning_bn (#CDATA)> thanks -kg ---- From: Kaushik Ghose <kgh...@wa...> - 2003-05-14 05:12 Ok, small correction, QTs DOM class seems to parse this correctly dictionary.dtd <?xml version="1.0"?> <!ELEMENT dictionary (entry*)> <!ELEMENT entry (word_bn, info_bn*) > <!ELEMENT word_bn (#CDATA)> <!ELEMENT info_bn (english?, pronounciation_bn?,meaning_bn?)> <!ELEMENT english (#CDATA)> <!ELEMENT pronounciation_bn (#CDATA)> <!ELEMENT meaning_bn (#CDATA)> test.xml <?xml version="1.0"?> <!DOCTYPE entry SYSTEM "dictionary.dtd"> <dictionary> <entry> <word_bn>????????????????????? ???????????????</word_bn> <info_bn> <english>seedling</english> <pronounciation_bn>ankur</pronounciation_bn> <meaning_bn>??????????????????? ??????????? ???????????????????????? ?????????????????? ??????????????????</meaning_bn> </info_bn> </entry> <entry> <word_bn>????????????????????? ?????????</word_bn> <info_bn> <english>bangla</english> <pronounciation_bn>bangla</pronounciation_bn> <meaning_bn>??????????????????? ????????????????? ????????????????????????, ????????????????????????? ??????????? ????????????????????????? ?????</meaning_bn> </info_bn> <info_bn> <english>bengali</english> </info_bn> </entry> </dictionary> thanks -kg ---- From: Deepayan Sarkar <deepa...@st...> - 2003-05-14 07:03 Ha! A friend of mine once corrected me on this, now I can correct someone else :) 'pronounciation' should be spelled 'pronunciation'. I'm not an expert on DTDs (though I know someone who knows much more, whom I can ask after after we make some progress). I find it very difficult to understand DTD's, and much easier to understand examples of what the final thing would look like. Let's work that way, and we can write out the DTD on ce we decide on the 'look'. I don't know if you know this, but there's something called attributes which might be useful. For instance, with multiple meanings as different parts of speech. Here's an example (I'm using slightly different tags) --- 'pos' is part of speech, 'plural' is whether the word has a plural form, etc.: <entry> <word>chhaanaa</word> <info pos="noun" plural="false" origin="deshi"> <meaning>dudh theke toiri ek dhoroner ...</meaning> <synonyms>...</synonyms> <antonyms>...</antonyms> ## ??? <translation lang="en">cottage cheese (?)</translation> <pronunciation>chhaanaa</pronunciation> </info> <info pos="noun" origin="tatbhabo"> #it's probably not, but... <meaning>shishu, bachchaa</meaning> <translation lang="en">child, young</translation> # comma separated <translation lang="hn">bachcha</translation> #hindi is hn ? not sure <pronunciation>chhaanaa</pronunciation> <derivative form="the">chhaanaaTaa, chhaanaaTi</derivative> <derivative form="of" num="singular">chhaanaaTir</derivative> <derivative form="of" num="plural">chhaanaader</derivative> </info> </entry> (I've used romanized bengali in place of what should be bengali, but you get the idea.) I think we should handle derivative words here (and not have separate entries for them. They can be generated from this). Sanskrit has very systematic rules for 'shabdarup'. Bengali isn't as systematic, but there are still quite general rules. We can formulate some rules and list down only derivative words that are exceptions to that rule. We have the standard forms: to, by, for, from, of and in plus maybe plurals, the, a --- anything else ? Also, Bengali (unlike English) often has many words which mean exactly the same thing. We might try to think of a way to have a single entry for all o f them. Can anyone (preferably with a dictionary at hand) think of anything else ? This is not very important right now, but what's a good format to store pronunciation ? ---- From: Taneem Ahmed <tan...@ey...> - 2003-05-14 08:33 On Wed, 14 May 2003, Kaushik Ghose wrote: > Hi, > here is the descriptor file. > I'm new to XML and DTDs so please go over the semantics as well as the > syntax an see if this serves our purpose... > > > <?xml version="1.0"?> > <!ELEMENT entry*(word_bn, info_bn*)> > <!ELEMENT word_bn (#CDATA)> > <!ELEMENT info_bn (english, pronounciation_bn,meaning_bn)> > <!ELEMENT english (#CDATA)> > <!ELEMENT pronounciation_bn (#CDATA)> > <!ELEMENT meaning_bn (#CDATA)> I remember someone mentioned something about multiple language support. Is it possible to have a general element instead of "english" so that it'll be easier to expand for other langauges? Taneem ---- From: Taneem Ahmed <tan...@ey...> - 2003-05-14 08:37 Sorry I didn't see Deepayan's mail when I sent my previous e-mail. His example is what I was talking about :) Taneem On Wed, 14 May 2003, Deepayan Sarkar wrote: ---- From: Kaushik Ghose <kgh...@wa...> - 2003-05-14 20:54 hi, On Wed, 14 May 2003, Deepayan Sarkar wrote: > > Ha! A friend of mine once corrected me on this, now I can correct someone else > :) 'pronounciation' should be spelled 'pronunciation'. > Okay :), so the new tag for this is <pron> >:D > I'm not an expert on DTDs (though I know someone who knows much more, whom I > can ask after after we make some progress). I find it very difficult to > understand DTD's, and much easier to understand examples of what the final > thing would look like. Let's work that way, and we can write out the DTD once > we decide on the 'look'. Sure, I think I've got the hold of elementary DTD (ie of the level I set out, so I can handle that -QTs happy, so am I...) > I don't know if you know this, but there's something called attributes which > might be useful. For instance, with multiple meanings as different parts of > speech. Here's an example (I'm using slightly different tags) --- 'pos' is > part of speech, 'plural' is whether the word has a plural form, etc.: > > <entry> > <word>chhaanaa</word> > <info pos="noun" plural="false" origin="deshi"> > <meaning>dudh theke toiri ek dhoroner ...</meaning> > <synonyms>...</synonyms> > <antonyms>...</antonyms> ## ??? > <translation lang="en">cottage cheese (?)</translation> > <pronunciation>chhaanaa</pronunciation> > </info> > <info pos="noun" origin="tatbhabo"> #it's probably not, but... > <meaning>shishu, bachchaa</meaning> > <translation lang="en">child, young</translation> # comma separated > <translation lang="hn">bachcha</translation> #hindi is hn ? not sure > <pronunciation>chhaanaa</pronunciation> > <derivative form="the">chhaanaaTaa, chhaanaaTi</derivative> > <derivative form="of" num="singular">chhaanaaTir</derivative> > <derivative form="of" num="plural">chhaanaader</derivative> > </info> > </entry> I would suggest only putting in the english synonym, or closest word - this is a question of size and interfacing. If we have a set of english synonyms we can then use that to link to an English-German dict say, or an English-Thai dict to have a bangla-thai dict for ex. If we start to put in translations for additional languages I think the file will become very large and slow to load. As it is, with the bangla word, bangla synonyms, antonyms, meanings and english synonyms I think we are going to deal with pretty large files for each bangla alphabet. Another issue to deal with is what we do with words that have no direct one word english equivalent. I couldn't get what "origin" means ? By plural="false" do you mean it doesn't have a plural form ? > I think we should handle derivative words here (and not have separate entries > for them. They can be generated from this). Sanskrit has very systematic > rules for 'shabdarup'. Bengali isn't as systematic, but there are still quite > general rules. We can formulate some rules and list down only derivative > words that are exceptions to that rule. We have the standard forms: > > to, by, for, from, of and in > > plus maybe plurals, the, a --- anything else ? This is fine, > Also, Bengali (unlike English) often has many words which mean exactly the > same thing. We might try to think of a way to have a single entry for all of > them. I would rather not. I'd say link it to the required word by putting that in the synonym, and in the <meaning> tag put in somethig like "see blah" > > Can anyone (preferably with a dictionary at hand) think of anything else ? > > > This is not very important right now, but what's a good format to store > pronunciation ? > unicode should do fine, there's a provision for the international phonetic alphabet http://www.unicode.org/charts/PDF/U0250.pdf so the next draft layout... <dictionary> <entry> <word_bn> chanaa </word_bn> <info pos="noun" plural="true" origin="??"> <pron>....</pron> <meaning_bn> baccha </meaning_bn> <synonym_bn>...</synonym_bn> <synonym_bn>...</synonym_bn> <antonym_bn>...</antonym_bn> <synonym_en>...</synonym_en> <synonym_en>...</synonym_en> <grammar> <derivative form="the">chhaanaaTaa,chhaanaaTi</derivative> <derivative form="of" num="singular">chhaanaaTir</derivative> <derivative form="of" num="plural">chhaanaader</derivative> </grammar> </info> <info pos="noun" plural="false" origin="??"> <pron>...</pron> <meaning_bn> khabar... </meaning_bn> </info> </entry> </dictionary> -kg ---- From: Deepayan Sarkar <deepa...@st...> - 2003-05-14 23:25 On Wednesday 14 May 2003 15:53, Kaushik Ghose wrote: > I would suggest only putting in the english synonym, or closest word - > this is a question of size and interfacing. If we have a set of english > synonyms we can then use that to link to an English-German dict say, or > an English-Thai dict to have a bangla-thai dict for ex. > If we start to put in translations for additional languages I think the > file will become very large and slow to load. Before we go any further, we need to decide how we are eventually planning to use the XML files. I don't think XML is a good format for use in any real application. For example, for a spell-checker to load the XML files directly would be very inefficient. Instead, the XML could be a repository of all possible information we might ever want to have. For a spell checker we could generate something that would contain only the words and nothing else (that could be a plain text file, or a database, could be in various different encodings and formats). Generating this from the XML may take a while, but if we do this once every two months or so, it shouldn't matter. Similarly for speech synthesis, we could extract only the actual word and its pronunciation, and leave everything else out. From that perspective, I don't think it should matter if the XML files become large. And of course we don't need to have a single file for each alphabet, we could split them as much as we want (maybe the first 3 letters identify each file) as long as given a word it's possible to identify which file that word belongs to. As for the translation, I'm not saying that we have to list translations in to all possible languages. But there's no harm in keeping the option. In fact, initially we won't even have english translations for the words that we already have. And as you point out, not all words will even have an English translation. All this wouldn't matter if we allow an arbitrary number (including 0) of instances of the <translation> tag for each word. The English->other language idea may not always be the best because there might be some words which have no proper english version, but could have, say, hindi versions. We could make it policy to include a non-english translation only when this is the case. But explicitly ruling out that opti on is not a good idea, I think. > As it is, with the bangla word, bangla synonyms, antonyms, meanings and > english synonyms I think we are going to deal with pretty large files for > each bangla alphabet. > > Another issue to deal with is what we do with words that have no direct > one word english equivalent. > > I couldn't get what "origin" means ? Basically tot-somo, tot-bhobo, dishi, bideshi, that sort of stuff. > By plural="false" do you mean it doesn't have a plural form ? Yes. > > I think we should handle derivative words here (and not have separate > > entries for them. They can be generated from this). Sanskrit has very > > systematic rules for 'shabdarup'. Bengali isn't as systematic, but there > > are still quite general rules. We can formulate some rules and list down > > only derivative words that are exceptions to that rule. We have the > > standard forms: > > > > to, by, for, from, of and in > > > > plus maybe plurals, the, a --- anything else ? > > This is fine, > > > Also, Bengali (unlike English) often has many words which mean exactly > > the same thing. We might try to think of a way to have a single entry f or > > all of them. > > I would rather not. I'd say link it to the required word by putting that > in the synonym, and in the <meaning> tag put in somethig like "see blah" Yes, that should be good enough. Maybe in those cases <word_bn>gabAkSha</word_bn> <info ...> <meaning_bn type="refer">jAnalA</meaning_bn> </info> > > Can anyone (preferably with a dictionary at hand) think of anything else > > ? > > > > > > This is not very important right now, but what's a good format to store > > pronunciation ? > > unicode should do fine, there's a provision for the international phonetic > alphabet > http://www.unicode.org/charts/PDF/U0250.pdf Cool. Does there exist a speech synthesizer which can work from this ? That way we could confirm that we enter the correct pronunciation. > so the next draft layout... > > > <dictionary> > <entry> > <word_bn> chanaa </word_bn> > <info pos="noun" plural="true" origin="??"> Since most words would have plural="true", we could omit that (the default would be "true"). > <pron>....</pron> > <meaning_bn> baccha </meaning_bn> > <synonym_bn>...</synonym_bn> > <synonym_bn>...</synonym_bn> Any problem with giving multiple synonyms comma separated ? > <antonym_bn>...</antonym_bn> > <synonym_en>...</synonym_en> > <synonym_en>...</synonym_en> I still think a translation tag with a language attribute would be more appropriate. > <grammar> > <derivative form="the">chhaanaaTaa,chhaanaaTi</derivative> > <derivative form="of" > num="singular">chhaanaaTir</derivative> > <derivative form="of" > num="plural">chhaanaader</derivative> > </grammar> > </info> > <info pos="noun" plural="false" origin="??"> > <pron>...</pron> > <meaning_bn> khabar... </meaning_bn> > </info> > </entry> > </dictionary> Otherwise looks OK (maybe an optional comment tag for each word), unless someone else can think of something. BTW, what's the use of the extra _bn for the tags (not that it matters) ? Deepayan ---- From: Kaushik Ghose <kgh...@wa...> - 2003-05-15 02:57 Hiya, On Wed, 14 May 2003, Deepayan Sarkar wrote: > Before we go any further, we need to decide how we are eventually planning to > use the XML files. > > I don't think XML is a good format for use in any real application. For > example, for a spell-checker to load the XML files directly would be very > inefficient. > > Instead, the XML could be a repository of all possible information we might > ever want to have. For a spell checker we could generate something that would > contain only the words and nothing else (that could be a plain text file, or > a database, could be in various different encodings and formats). Generating > this from the XML may take a while, but if we do this once every two months > or so, it shouldn't matter. Similarly for speech synthesis, we could extract > only the actual word and its pronunciation, and leave everything else out. > > >From that perspective, I don't think it should matter if the XML files become > large. And of course we don't need to have a single file for each alphabet, > we could split them as much as we want (maybe the first 3 letters identify > each file) as long as given a word it's possible to identify which file that > word belongs to. > > As for the translation, I'm not saying that we have to list translations into > all possible languages. But there's no harm in keeping the option. In fact, > initially we won't even have english translations for the words that we > already have. And as you point out, not all words will even have an English > translation. All this wouldn't matter if we allow an arbitrary number > (including 0) of instances of the <translation> tag for each word. > Ok, that seems fine. The size of the files will matter for the GUI that does the dicto editing and any online collaboration tool we come up with for creating the dicto, but yes, we'll have automated tools to create (like you, may be on the first of every two months) separate file clusters for spell checkers, theasauri etc. which can be more compacted. Now, for the translation. Are we looking to put in one word that can link this bangla word to a word in some other dicto ? Or are we looking to give a translation of it ? For that we can probably end up with two sets of tags. <synonym lang ="">...</synonym> <meaning lang ="">...</meaning> where synonym is the one word thingy, meaning is well a paragraph or so. > Yes, that should be good enough. Maybe in those cases > > <word_bn>gabAkSha</word_bn> > <info ...> > <meaning_bn type="refer">jAnalA</meaning_bn> > </info> Yes, good idea, I'd prefer a separate tag <refer> which would do this job. we could do it via synonyms too, may be everything... > Cool. Does there exist a speech synthesizer which can work from this ? That > way we could confirm that we enter the correct pronunciation. Didn't go much through it but here's a promising site http://www.vorde.org/prodVordeTech/documents/vorde/split/node28.html > > so the next draft layout... > > > > > > <dictionary> > > <entry> > > <word_bn> chanaa </word_bn> > > <info pos="noun" plural="true" origin="??"> > > Since most words would have plural="true", we could omit that (the default > would be "true"). > > > <pron>....</pron> > > <meaning_bn> baccha </meaning_bn> > > <synonym_bn>...</synonym_bn> > > <synonym_bn>...</synonym_bn> > > Any problem with giving multiple synonyms comma separated ? > > > <antonym_bn>...</antonym_bn> > > <synonym_en>...</synonym_en> > > <synonym_en>...</synonym_en> Yeah, I couldn't figure out if commas would tell the parser these are separate instances, or just one big glob of text, so I played it safe... > I still think a translation tag with a language attribute would be more > appropriate. Yes. > > <grammar> > > <derivative form="the">chhaanaaTaa,chhaanaaTi</derivative> > > <derivative form="of" > > num="singular">chhaanaaTir</derivative> > > <derivative form="of" > > num="plural">chhaanaader</derivative> > > </grammar> > > </info> > > <info pos="noun" plural="false" origin="??"> > > <pron>...</pron> > > <meaning_bn> khabar... </meaning_bn> > > </info> > > </entry> > > </dictionary> > > Otherwise looks OK (maybe an optional comment tag fr each word), unless > someone else can think of something. > > BTW, what's the use of the extra _bn for the tags (not that it matters) ? Yeah, that should get replaced by the lang tag. so here it is (hopefully I remembered everything) <dictionary> <entry> <word>...</word> <info pos="noun" plural="false" orign="." date="."> <pron>...</pron> <synonym lang="bn">...</synonym> <synonym lang="bn">...</synonym> <antonym lang="bn">...</antonym> <synonym lang="en">...</synonym> <meaning lang="bn">...</meaning> <meaning lang="en">...</meaning> <grammar> <derivative form="the" num="singular">...</derivative> </grammar> </info> </entry> </dictionary> I'll make a DTD and see if I can make a GUI for it... -kg ---- From: Deepayan Sarkar <deepa...@st...> - 2003-05-15 04:13 On Wednesday 14 May 2003 21:56, Kaushik Ghose wrote: > Ok, that seems fine. The size of the files will matter for the GUI that > does the dicto editing and any online collaboration tool we come up with > for creating the dicto, but yes, we'll have automated tools to create > (like you, may be on the first of every two months) separate file clusters > for spell checkers, theasauri etc. which can be more compacted. Yes, we do need to plan ahead so that individual files don't get very big. Since the main purpose of the GUI is to enter new words and edit existing words, the only requirement is that given a word we should be able figure out which file it should be in. That way, if the file doesn't exist, the program could create a blank instance of the XML document object, and if it does exist, parse it and read it into memory. As for the file structure, we could consider a separate directory for each starting character, then one file for each combination of first 3 letters (I'm not sure what the best way to name these files would be). But we may need to adjust this depending on how many files per directory and how many words per file this would make. Could you run through the existing words and get an estimate (basically count combinations of first 3 characters) ? > Now, for the translation. Are we looking to put in one word that can link > this bangla word to a word in some other dicto ? Or are we looking to give > a translation of it ? For that we can probably end up with two sets of > tags. > > <synonym lang ="">...</synonym> > <meaning lang ="">...</meaning> > > where synonym is the one word thingy, meaning is well a paragraph or so. Again, no harm in keeping the option (that way, we could potentially have a bengali to english dictionary as well as a bengali to bengali). > > Yes, that should be good enough. Maybe in those cases > > > > <word_bn>gabAkSha</word_bn> > > <info ...> > > <meaning_bn type="refer">jAnalA</meaning_bn> > > </info> > > Yes, good idea, I'd prefer a separate tag <refer> which would do this job. > we could do it via synonyms too, may be everything... OK. > > Any problem with giving multiple synonyms comma separated ? > > > > > <antonym_bn>...</antonym_bn> > > > <synonym_en>...</synonym_en> > > > <synonym_en>...</synonym_en> > > Yeah, I couldn't figure out if commas would tell the parser these are > separate instances, or just one big glob of text, so I played it safe... The comma is not special in XML, so it would be interpreted as a single long string. But we could always interpret them correctly inside applications. Anyway, it's not that important. > so here it is (hopefully I remembered everything) > > <dictionary> > <entry> > <word>...</word> > <info pos="noun" plural="false" orign="." date="."> What's date ? The last modification time ? > <pron>...</pron> > <synonym lang="bn">...</synonym> > <synonym lang="bn">...</synonym> > <antonym lang="bn">...</antonym> > <synonym lang="en">...</synonym> > <meaning lang="bn">...</meaning> > <meaning lang="en">...</meaning> > <grammar> > <derivative form="the" > num="singular">...</derivative> > </grammar> > </info> > </entry> > </dictionary> > > I'll make a DTD and see if I can make a GUI for it... Great. I have done this sort of programming in Python, but not C++. I might be able to help once you get something going. I think it might be useful to start by writing a class to represent a single XML file, with methods to add and modify tags (rather than directly accessing the XML document object all the time). That way, if there are minor changes in the DTD, we just need to modify this class. Deepayan ---- From: Kaushik Ghose <kgh...@wa...> - 2003-05-16 15:07 <?xml version="1.0"?> <!ELEMENT dictionary (entry*)> <!ELEMENT entry (word, info*) > <!ELEMENT word (#CDATA)> <!ELEMENT info (refer?,pron?, synonym?,antonym?,meaning?,grammar?)> <!ATTLIST info pos (n|adj|v|adv) "n" plural (true|false) "false" origin CDATA #DEFAULT "????????????" date CDATA> <!ELEMENT refer (#CDATA)> <!ELEMENT pron (#CDATA)> <!ELEMENT synonym (#CDATA)> <!ATTLIST synonym lang CDATA #DEFAULT "bn"> <!ELEMENT antonym (#CDATA)> <!ATTLIST antonym lang CDATA #DEFAULT "bn"> <!ELEMENT meaning (#CDATA)> <!ATTLIST meaning lang CDATA #DEFAULT "bn"> <!ELEMENT grammar (derivative?)> <!ELEMENT derivative (#CDATA)> <!ATTLIST derivative form (the|of) "the" num (singular|plural) "singular"> also, to answer Deepayan's question by date I was thinking of date of origin, first use etc. Will potter with QT right now, I'm goign to hardcode the DTD structure, I can't think of a simple way of creating an editor that will parse the DTD and configure the GUI on the fly - fixed boxes for all teh element will be quicker for this size DTD PS. try the perl tool at http://www.sagehill.net/livedtd/download.html -kg </thread> ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core