Wednesday, January 19, 2011

19th January, 2011

(Version: 0.4.2)

Syntactic Annotation:

  • You can now use different tagsets for different languages. As an example, the list of POS tags for English will have to be stored in the file pos-tags-eng.txt (in the directory Sanchay/workspace/syn-annotation). The default list (used when language specific list for a particular language is not available) will be in the file pos-tags.txt. Similarly for chunk tags (phrase-names-eng.txt). When you select the language in the annotation interface, the tagset for that language (if available) will be used. Otherwise, the default tagset will be used.

  • It is also possible now to use hierarchical tagsets. Such tagsets are stored in the same location as the normal tagsets (as above). The levels in tags are specified with double underscore. So, for example, if you want three levels of verb tags (from the most general or coarse to the most specific or fine), you will have to list the tag as something like V1__V2__V3.

  • Hierarchical tagsets are used thus: For a given category (say, verb), you can select the level that you are going to use for annotation. For other categories, you can use other levels. The idea is that the annotation can be done at different levels of generality or coarseness for different categories, depending on the purpose for which the annotation corpus is going to be used. For using the second level for verb while annotating a particular corpus, you will select V2 as the desired level before you start annotating. This selection can be done from the syntactic annotation interface by right clicking and selecting Hierarchical Tags. The selection is stored in a mapping file (e.g. pos-tags-levels.txt) in the workspace/syn-annotation directory.

  • Duplicate attributes with the same name will not be added now (a bug fix).

  • A few other bugs fixed.

XML for Syntactically Annotated Corpus:

  • This is major change in the version. Earlier, syntactically annotated corpus (and also some other kinds of corpus) was stored in the SSF format (as the default). Now you can open and save files in XML format too.

  • The XML tags used for the XML format are listed in the file xml-props.txt in the Sanchay/props directory.

  • XML format will work wherever SSF files were used earlier.

Bugs in Automatic Name Allocation :

  • In the annotation interfaces where SSF files were used, the uniques names are automatically generated to identify the nodes. This name generation had some bugs like not handling characters like apostrophe and quotes etc.

  • I have tried to fix most of these bugs. If something remains, let me know.

Word Alignment and Sentence Interfaces:

  • It is now possible to save in the GIZA++ format (reading this format was available earlier).

  • The problems related to wrong handling of apostrophe etc. should not happen now.

  • Duplicate alignments (due to duplicate attributes) should also not appear.

Sanchay Corpus Query Language:

  • Some bugs fixed in handling queries involving the referred and referring nodes (the R and T nodes).

  • Support for the logical operator NOT added. Thus, you should now be able to write something like: C.t='NN' and !(C.A.t='NP')