Saturday, September 1, 2012

1st September, 2012

Release of version 0.6.2.

Added support for calling the Java Sanchay corpus API (SSF API) from Python programs. An example program is provided in the examples directories. Before running the Python program, a script has to be run to start the Java-Python bridge, called the

All the usual operations should be possible using this (from Python). Querying on annotated corpus, using the Sanchay Corpus Query Language is also possible.

Thursday, August 23, 2012

23rd August, 2012

Releasing the 0.6.1 version. It has the bug fix added for Word Alignment Interface. Earlier, on loading the data aligned with the Sentence Alignment Interface, the alignments were not being shown properly.

There might still be some minor issues with this. For example, in the case of one-to-many or many-to-one alignments, if the sentences on the *many* side have words with identical names (unique identifiers), there will be problem. An option for this is to augment alignment labels with the sentence ids to make them unique across the corpus file.

In the case of one-to-many or many-to-one alignments, the Next and the Previous buttons can be used to navigate to those alignments. The combo box will still show source sentence numbers and on going to a particular source sentence, you will reach the first alignment that it has. The following alignments can be reached with the Next button.

Monday, August 20, 2012

20th August, 2012

Version 0.6.0 of Sanchay released on the website (latest updates section).

This release comes so long after the previous one that I don't remember what changes have happened, but there aren't many.

However, from now the releases might be more systematic. Also, source code release that has not happened for quite some time now, is also expected soon.

There is a bug with the Word Alignment Interface that is being fixed and the next release should be after that. Probably within a few days. With source code.

Saturday, May 21, 2011

21st May, 2011

An incomplete list of some goings on:

  • The Sanchay website has been moved to a different server. There were two main reasons for this. The first was that certain things (like online Java applications) can't be hosted on (where the site was earlier hosted) and the online apps were already hosted on another server. The other reason was that doesn't allow outgoing emails, which means that if I create some user's account on the site (where the user is not a user), the confirmation mail won't be sent to that user.
  • In the Syntactic Annotation Interface, it is now possible to build a dependency tree directly on lexical items (words), rather than on chunks. You can, of course, still use the chunk mode, which is the one being used for the major treebank projects for Indian languages.
  • There have been some extensions to the corpus query language (Sandhaan). The website for Sandhaan is also being moved, though there not much content there.
  • There is now a facility that can give you accumulated statistics for syntactically annotated data. You can query it for specific words, tags, relations etc. For example, if you want to check what tags have been given so far for a selected word, you can do that. You can do the opposite too, i.e., you can what words have been assigned a given tag (say, the tag given to the current word in the Syntactic Annotation Interface). The same for chunks and chunk tags as well as chunks and chunk relations etc.
  • There is a version of the validation tool that now uses Sandhaan queries, instead of programs or scripts in Perl or some other language.
  • All these changes were made sometime ago and many things have happened since then, so I am having trouble remembering what other changes were there ...

Wednesday, February 23, 2011

23rd February, 2011

A new avatar (no connection to the movie) of Sanchay is going to appear soon.

The first Sanchay Online Application is on the way. Like most of the major parts of Sanchay, it was done in three-four days.

So that's why there are so many bugs?

Watch this space (as they say)...

(Also note that the latest builds now will be available here, instead of here.)

Wednesday, January 19, 2011

19th January, 2011

(Version: 0.4.2)

Syntactic Annotation:

  • You can now use different tagsets for different languages. As an example, the list of POS tags for English will have to be stored in the file pos-tags-eng.txt (in the directory Sanchay/workspace/syn-annotation). The default list (used when language specific list for a particular language is not available) will be in the file pos-tags.txt. Similarly for chunk tags (phrase-names-eng.txt). When you select the language in the annotation interface, the tagset for that language (if available) will be used. Otherwise, the default tagset will be used.

  • It is also possible now to use hierarchical tagsets. Such tagsets are stored in the same location as the normal tagsets (as above). The levels in tags are specified with double underscore. So, for example, if you want three levels of verb tags (from the most general or coarse to the most specific or fine), you will have to list the tag as something like V1__V2__V3.

  • Hierarchical tagsets are used thus: For a given category (say, verb), you can select the level that you are going to use for annotation. For other categories, you can use other levels. The idea is that the annotation can be done at different levels of generality or coarseness for different categories, depending on the purpose for which the annotation corpus is going to be used. For using the second level for verb while annotating a particular corpus, you will select V2 as the desired level before you start annotating. This selection can be done from the syntactic annotation interface by right clicking and selecting Hierarchical Tags. The selection is stored in a mapping file (e.g. pos-tags-levels.txt) in the workspace/syn-annotation directory.

  • Duplicate attributes with the same name will not be added now (a bug fix).

  • A few other bugs fixed.

XML for Syntactically Annotated Corpus:

  • This is major change in the version. Earlier, syntactically annotated corpus (and also some other kinds of corpus) was stored in the SSF format (as the default). Now you can open and save files in XML format too.

  • The XML tags used for the XML format are listed in the file xml-props.txt in the Sanchay/props directory.

  • XML format will work wherever SSF files were used earlier.

Bugs in Automatic Name Allocation :

  • In the annotation interfaces where SSF files were used, the uniques names are automatically generated to identify the nodes. This name generation had some bugs like not handling characters like apostrophe and quotes etc.

  • I have tried to fix most of these bugs. If something remains, let me know.

Word Alignment and Sentence Interfaces:

  • It is now possible to save in the GIZA++ format (reading this format was available earlier).

  • The problems related to wrong handling of apostrophe etc. should not happen now.

  • Duplicate alignments (due to duplicate attributes) should also not appear.

Sanchay Corpus Query Language:

  • Some bugs fixed in handling queries involving the referred and referring nodes (the R and T nodes).

  • Support for the logical operator NOT added. Thus, you should now be able to write something like: C.t='NN' and !(C.A.t='NP')

Friday, April 9, 2010

10th April, 2010

Some more information about the task mode of operation for the Annotation Interfaces (not yet implemented for the Parallel Corpus Markup):

  • The annotation process starts with one copy of the document to be annotated, say, story-1.

  • When an annotator (say, john) claims this document (opens and saves it in the task mode), another copy is created, with the file name story-1-john. This is the file on which the annotator will be working.

  • However, the task name (e.g. story-1) will apply to all the copies.

  • When another user (e.g. terry) has claimed the same task, another copy will be created (story-1-terry).

  • Now the adjudicators can use the annotation comparison facility to compare the two annotations.

  • The adjudicators can select one of the two annotations and make changes to it directly from the comparison facility.

  • When an adjudicator (say, chapman) saves the selected and modified version of one of the annotations, it gets saved with the original document name (story-1), i.e., the original document is overwritten.

  • An option worth considering is whether, instead of overwriting the original document, another copy should be created with the name of the adjudicator (story-1-chapman). But what if the adjudicator is also one of the annotators? That shouldn't be a problem, because it can't be true for the same document.

  • However, the copies of the work by the two annotators remain available for more work by the annotators or for any other use later, such as calculating inter-annotator agreement.

That brings me to the facility I will try to add soon: calculating inter-annotator agreement for different levels of annotation.

Also, note that the task mode of working is not yet available for sentence alignment and word alignment interfaces. That is another item on the agenda.