Saturday, March 27, 2010

28th March, 2010

Update 1 released for the Sanchay version 0.4.0 on Sourceforge.net.

This is a minor update, so you can just download the jar file and replace the old one (in the Sanchay/dist directory) with this one.

There are the following changes/additions in this release:

1. In the Syntactic Annotation interface, when you open a raw text file (simple text without any annotation), the tokenizer that runs will now take into account the old Hindi/Sanskrit sentence/shloka/verse marker (e.g. ।। 1 ।।).

2. You can now convert the encoding of a syntactically annotated file, not just that of a simple text file. This facility has been connected to the Syntactic Annotation interface. (Click on the More button after opening a file).

Thursday, March 18, 2010

18th March, 2010

I have released the binary for the 0.4.0 version of Sanchay on Sourceforge. I will release the source code later.

The file released is a zip file containing complete Sanchay. For the next few minor releases, I will only release the jar file (and source code, if possible). The jar file in the current zip file can then be replaced with the new jar file to update your copy of Sanchay from 0.4.0 to, say, 0.4.1. For very minor changes, I will put the jar file on Box.net.

The releases should also be available from the previous location for latest builds.

Tuesday, March 16, 2010

16th March, 2010

A bug fixed, related to navigation from the result of query processing of multiple files. As I am not able to access the usual site where I put the latest versions, I am putting the jar file with the modification here:

Sanchay jar file only (16th March, 2010)

You should also be able to see the Box.net widget on the side panel on the right.

I am not able to put the complete zip file as the Box.net site account that I have allows files only up to 25 MB.

Monday, March 15, 2010

15th March, 2010

The Sanchay Home has now moved to http://sanchay.co.in. Also, the Sourceforge UNIX name of Sanchay is now sanchay, instead of nlp-sanchay. Thus, the Sourceforge project page is now http://sourceforge.net/projects/sanchay, instead of http://sourceforge.net/projects/nlp-sanchay. Due to this, you might find that some links are not going where they went earlier, especially if you are coming from search engines (the Google page shows the old link). I have updated in many places (as have Sourceforge people on their side), but there might still be problems in some places. But if you start from the new Sanchay Home, most of the things should be accessible.

If you are one of the users of Sanchay, it would be helpful if you let me know and (even better) send some feedback, list of bugs found, suggestions for new facilities, suggestions for modifications etc.

Friday, March 12, 2010

News Till Now

Things have been happening on the Sanchay front and hopefully that will continue for a long time.

One of the problems (perhaps the biggest) is that there is no documentation. But I still can't afford to seriously address that problem as I have to finish some urgent work. However, for some time now I have been posting updates about changes and additions to Sanchay here. Since I have got into the good habit of doing this, I will now post the updates using the medium I have become comfortable with.

As a result, I have set up this blog to post any new information about changes and additions to Sanchay. In this first post of the blog, I am going to post all the updates till now from the time I started posting them:

10th March, 2010: Couple of bugs fixed in query processing. Current directory will now be saved for the multiple files mode too.



26th Feb, 2010: Two more additions to the SCQL:




  • Return Values: On the RHS, you can specify return values by using the node symbols and the dot notation (e.g., C, N.A). For this purpose, another symbol S can be used to return sentences for the nodes which match. The syntax is intuitive and easy to remember: If you don't provide an assignment value, then the node address is treated as the return value. One restriction at present is that only node addresses can be return values.

  • Destination: You can save the matched nodes or sentences in a file that you specify.



An example of extracting sentences which have the dependency relation 'pof' for the verb stem कर is given below:



A.a['drel']='pof' AND N.a['lex']='कर' -> S := ssf:qr.txt:UTF-8


Note that there is also a new 'destination' operator :=, which might be followed in the query by a specification of the destination in terms of the format (ssf, bf or bracket format, pos or POS tagged and raw or simple text), the file path and the encoding, which are separated by colon. Optionally (e.g. when working from the GUI) you can leave the destination specification blank:



A.a['drel']='pof' AND N.a['lex']='कर' -> S :=


In this case you will be asked for the path of the file and the charset (encoding). The same will apply for querying multiple files (which right now works only from the interface). In the multiple files case, the path for the destination file will be asked first, followed by the paths of the source files.



You can provide multiple return values, but that will not work at present for the case when you also want to use the destination operator (it will work otherwise):



A.a['drel']='pof' AND N.a['lex']='कर' -> P and C and N


The above will return the current, the previous and the next node.



You will notice a couple of differences in the results table that is displayed. The navigation works in the same way.



14th Feb, 2010: One more wildcard added: . (the last one to match). A bug in the processing of wildcards and ranges fixed. Wildcards are now inclusive, whereas ranges are exclusive. You can use the M symbol on the LHS as well as the RHS. Also, you can use wildcards and ranges on the M nodes too. As an example, you can write the following query to transform the XC XC ... NN kind of sequences (mentioned in the previous updates) to NNC NNC ... NN:



P[*].t/p='XC' and C.t!='XC' -> M[p/*].t=C.t+'C'


A couple of points may be noted. First, the match alias (p) comes after the separator (/) when it is assigned to a condition, and it comes before the separator when it is used to access the matched nodes. Second, the meaning of the wildcard * is any for the condition nodes (A, D, N, P and T) and all for the matched nodes (M). Wildcards are not applicable to nodes which are nesessarily single (C, R).



Similarly, to mark predictable karta relations:



C.l='ने' AND C.f='t' AND A.N[?].t/q='VGF' -> A.a['drel']=''k1':M[q/?].a['name']'


I guess the basics of the language are now in place. There are some other things that I plan to do, but for some time I might shift to providing a way to use the language from the command line.



Language-Encoding Identifier: A shell file added to run the language-encoding identifier. To run it on your data, you will have to modify the shell file slightly by providing the path to a directory or to a file (whose language is to be identified) in place of the string 'data/enc-lang-identifier/testing/Hindi-UTF8'. Language identification will be performed on all the files in the directory if you give a directory path. The output is simple: the file path and the guessed language-encoding. You can also change the other arguments if you know how to. If you don't, you can ask me.



13th Feb, 2010: Some major additions:



First, support for some wildcards added. At present, there are two wildcards: ? (first one to match) and * (any/all). I forgot to add another obvious one, but I will do that soon.


Second, support for ranges added. Ranges could be specified with -, e.g. N[2-4], P[-4], A[2-] etc.



Third, you can now access the matched nodes (other than the current node C on the RHS through a new node type M). If there is only one condition, you can access the matched nodes without assigning an alias (the tentative separator for providing an alias is /), just through the symbol M, e.g. M.t='NN'. Otherwise, you can provide an alias for every condition and access the matched nodes through that alias. For example:

C.t~'^V' and N.t/p='VAUX' and A.a['num']/q='pl' -> M[p].t='AuxV' and M[q].t='VPP'.
Note that since aliases work like indices, not names, so they are written without quotes. Names and values could be have to be evaluated (as they might contain variables added through the device of concatenation), but not aliases.



See a more practical example below.



You can use the matched node symbol M in the same way as other node symbols, i.e., you can use the dot notation to get to the other nodes in the context of the matched nodes on the RHS (e.g. M.N[2].t). In principle, you should even be able to do this on the LHS too, though I haven't tried that yet.



Fourth, you can search and navigate even in the multiple file mode. When you click on some match, the respective file will be opened in a new tab and you will be taken to the corresponding sentence. The matched nodes ('current nodes' for the query) will be highlighted. I will provide a facility to read queries from a file and run them on multiple annotated files, which will be useful for purposes such as performing a quick sanity check on the annotated data.



Some bugs fixed, so a couple of things that were not working earlier should work now (such as the XC XC NC example in the last update).



An example to use wildcards to mark the predictable karta (agent) relations:



C.l='ने' AND C.f='t' AND A.N[?].t/q='VGF' -> A.a['drel']=''k1':M[q].a['name']'


There might be some change in the way wildcards and ranges work. At present both are inclusive. But it will be better to have ranges working in the exclusive mode (either all match or none) and wildcards working in the inclusive mode (even if some don't match, others can).



7th Feb, 2010: Concatenation operator (+) added. One example of its use is for coverting sequences like XC NN or XC NNP to NNC NN and NNPC NNP, respectively:



C.t='XC' and P.t!='XC' and N.t!='XC' and C.f='t' -> C.t=N.t+'C'


For a sequence of length three (e.g. XC XC NN), at present you will have to write this:



C.t='XC' and P.t!='XC' and N.t='XC' and C.f='t' -> C.t=N[2].t+'C' and N.t=N[2].t+'C'


Once support for wildcards has been added, this should become simpler.



Two commands have also been added that can be used to reallocate the node ids and to reallocate unique names:




  • ReallocateIDs

  • ReallocateNames



The commands are processed in the same way as the queries. The second command above can be used to generate names before you start marking the dependency relations. A lot more commands will be added later.



Another changes is that the queries are now case insensitive, except for regular experession.



The query language is mostly independent of the annotation scheme, i.e., it is not restricted to the scheme based on the Paninian Grammar. The tags and attribute names used in the examples, however, are for this scheme because most of the people that I know who are using Sanchay for annotation are working with this scheme. Both Sanchay and the query language can be easily adapted for other annotation schemes. Moreover, we will soon be shifting to a purely XML based format. Nor are they restricted to any specific languages. Some tools in Sanchay only work for some languages, but they can be made to work for other languages if the required data, rules etc. are available.



6th Feb, 2010An example for marking dependency relations that are very predictable is as follows. Suppose (for a Hindi corpus) you want to mark - every NP chunk that contains one of the words (post-positions or vibhaktis) का, के, की and is followed by another NP - with a genitive relation (r6) to the next NP, then you can write this trasformative query:



C.l~'^का$|^के$|^की$' AND C.f='t' AND A.N.t='NP' -> A.a['drel']=''r6':A.N.a['name']'


This will work only if unique names have been generated before this query is applied. I will write in the next update about how that can be done with the same query mechanism.



Be careful about the use of quotes to differentiate the literal part of the value ('r6') from the part that has to be evaluated (A.N.a['drel']). Also note that there is no space before or after the colon (:).



You can also notice one major change in the notation: square brackets are used now for indices (whether integers or keys) and parentheses are used instead for nested conditions as mentioned in the last update.



5th Feb, 2010: Some more extensions to the SCQL. It is now possible to give queries using nested conditions, combining AND and OR operators. Also, two essential operators added: not equal (!=) and not like (!~). The corrected query (from the last update) for tag validation would be:



C.t!~'^NN$|^JJ$|^VM$|^PSP$' AND C.f='t'


An example of a nested query is:



((C.t~'^N' OR C.t!~'^V') AND C.f='t') OR (C.l~'है' AND C.t~'V')


Queries on multiple files should also work. Note that currenty the transformations happen in-place and no backup file is created. That should change soon. Also, at present there is no nesting on the RHS (as the meaning of nesting on the LHS and RHS is different), but I am working on that.



1st Feb, 2010: Facility to query documents in the Syntactic Annotation interface extended. (The query language has been christened, quite unimaginatively, as the Sanchay Corpus Query Language or SCQL). Initial support for transformations added. Two other kinds of nodes added: R (referred node) and T (referring node). Since referring nodes can be more than one, they can be accessed through indices, e.g. T(1) or T(2). The referred node can be accessed by providing the name of the attribute through which it is referred, e.g. R('drel') would give the node that is referred to by the current node with the attribute 'drel' (as in drel='k1:NP1'). You can also check whether a node is a leaf node by writing something like C.f='t'. The two possible values of .f are 't' (true) and 'f' (false). You can query the level of a node in the tree by writing something like C.v='2'. Another important addition is that if you use the symbol ~ instead of =, the values will be treated as regular expressions.



Thus, to replace one tag with another, you can write:



C.t~'^NST$' -> C.t='STNoun'


Multiple transformations can be performed by using the AND operator on the RHS (Right Hand Side), including on nodes other than the current node by using the same notation as for the LHS. But it would be advisable to try first on some sample data as this facility has not yet been tested by others and I have only tried some simple things.



Queries can be written to perform a sanity check after manual or automatic annotation. More details about that later, but as an example, you can check for invalid POS tags by writing a query like:



C.t~'^NN|JJ|VM|PSP$' AND C.f='t' (Not correct: see the next update)


...assuming that there are only four POS tags as listed above. Attribute values and chunk tags can also be validated in a similar way.



Yet another addition is that you can first convert the (chunk) tree to dependency tree and perform the query on that by prefixing the query with DS: as below:



DS: C.v='0' AND D(2).t~'.*'


For a document this will return all the sentences (root nodes) where only partial dependency annotation has been performed ('hanging nodes'). Note that this won't return sentences where dependency annotation has not been started at all. For this latter case, you can use this query:



DS: C.v='0'


This will return all nodes on which some or all dependency annotation has been performed. If any sentence is missing in this list, annotation on that sentence has not been started at all. (But you can write a more directy query by using the != operator as mentioned in the next update.)



To check for nodes outside any chunks:



C.v='1' AND C.f='t'


And to check for the presence of any nested chunks:



C.v='2' AND C.f='f'


27th Jan, 2010: Facility to query documents added to the Syntactic Annotation interface. For example, to find all nodes with the lexical data as लिए such that the previous node has the tag PSP you can write this query:



C.l='लिए' AND P.t='PSP'


The results returned will include the matched node, its parent node and the node referred by it (where applicable) for dependency relations, e.g. if you search for a('drel')='pof', you will also get the node to which this 'pof' relation points.



Similarly to find nodes with the lexical data as 'के' and the parent (usually the chunk) having the tag NP, the query can be:



C.l='के' AND A.t='NP'


The following keys should be helpful till I add more information: C (Current node), P (Previous node), N (Next node), A (Ancestor node), D (Descendent node), l (lexical data or the word), t (tag) and a (attribute).



A query involving an attribute is:



C.l='के' AND A.a('cat')='n'


Note that the attribute names (like 'cat' above) and literal values (like 'के') must be enclosed in quotes. Values not enclosed in quotes must be node addresses using the above notation. Some limitations of this new implementation are: no nested queries, no wildcards or regular expressions and no ranges. Some of these will be removed soon. But you can something like A(2) (grandparent node) or D(1) (the first child) or N(2) (the node next to the next node). A(1) means the same as A and similar is the case for N and P.



25th Jan, 2010: The statistics facility in the Syntactic Annotation interface extended. The following statistics about the document being annotated are available now:




  • Number of paragraphs (if applicable)

  • Number of sentences

  • Number of words

  • Number of characters

  • Number of chunks

  • Number of POS tags

  • Number of chunk tags

  • Number of attributes

  • Number of attribute values

  • Number of attributes value pairs

  • Number of untagged words

  • Words and their frequencies

  • POS tags and their frequencies

  • Chunk tags and their frequencies

  • Word-tag pairs and their frequencies

  • Unchunked words and their frequencies

  • Lists of words for each tag

  • Lists of tags for each word



18th Jan, 2010: In the Syntactic Annotation interface, you can now navigate to a sentence by clicking on the table showing the search results.



17th Jan, 2010: A shell script to extract sentences based on surface similarity (for Hindi) added. Another script to run a Statistical Machine Translation (SMT) based transliteration tool added (for English-Hindi and Hindi-English). If you run the scripts without any arguments, you will see a usage description. Some bugs fixed.



7th Jan, 2010: Many internal changes which are not yet reflected in the GUI. Apart from them, the Sentence Aligner interface connected with an automatic sentence aligner that should work better for the English-Hindi pair than for other pairs. The main window now has the list of available input methods in a combo box, which should make it easier to switch between input methods (for typing in different languages). I will add the facility to set shortcuts for selecting input methods soon. A couple of other bugs fixed. In the menu you will see commands for connecting to a remote computer, but they are not working yet.



11th Dec, 2009: A new tool added for creating linguistic trees based on the X-bar theory (but it can be used for any kind of phrase structure trees and may be even other kinds of trees). There are shortcuts to add binary and ternary subtree, triangles, adjuncts etc. Features to nodes can be added by right clicking and selecting. The tags, features and their values can be easily customized. The trees created can be stored in the SSF format as a text file and can also be exported as jpg or eps images. Terminal nodes can be edited to add text by clicking on them and typing. Most of the tree creation can be done just by using the keyboard (if you feel comfortable with that, otherwise you can use the mouse). To edit the text in a terminal node, move to that node using the keyboard (or the mouse) and press spacebar (or double click) and type. Pressing Enter will complete node text editing. If the text is not displayed completely, just click on the + (Zoom In) or - (Zoom Out) button once.



9th Dec, 2009: The first working version of the Sentence Alignment interface completed. The input and output formats are the same as for the Word Alignment interface. The alignment positions are synchronized and no crossing edges are allowed. Many-to-one alignment is possible as long as there is no crossing.



7th Dec, 2009: Facility to edit and save shortcuts added (currently available for only the Frameset editing mode). A bug that was introduced after the last changes has been fixed. (The bug was related to dislaying the dependency trees in the Syntactic Annotation interface).



4rd Dec, 2009: The facilities to delete individual alignments and all alignments in a sentence pair added. Also added the facility to reset the alignments of a sentence pair to the values previously saved. To delete an individual alignment, just press the Ctrl key and redraw (drag-and-drop) the alignment. Another addition is that you can load data that is in GIZA++ format.



3rd Dec, 2009: The first working version of the Word (and Phrase) Alignment interface completed. Unlike the Parallel Corpus Markup interface, this one allows alignment by drag-and-drop. It is possible to group together words in the source and the target language sentences, tag them and align them. The layout is horizontal, unlike the Syntactic Annotation interface. Alignment can be many-to-many. The input file can be simple text (one sentence per line) or text file in SSF format. The saved files are currently in SSF format. The alignment information is stored in the SSF files through an additional attribute 'alignedTo'. You can also try the initial version of the Sentence Alignment interface, but it won't yet save the alignments.



24th Nov, 2009: A problem introduced due to the recent (15th Nov) changes corrected. The Sentence and Word Alignment interfaces are being redesigned to allow easier and faster manual alignment. They are not yet complete but should be: soon.



16th Nov, 2009: One bug fixed in syntactic annotation replace facility. The data for the Language and Encoding Identifier recompiled so as to work with the current version.



15th Nov, 2009: There are a few more components in this version, one of them being a fully functional charmap cum font viewer and another being a (manual) sentence alignment interface. Some bugs in the Frameset mode of syntanctic annotation have also been fixed. Also, this version brings a new mode of operating Sanchay, i.e., through the Sanchay Shell. A simple toolbar has been added to quickly start applications. In the Word List Visualizer tool, you can now use Surface Similarity to search for similar words (for now only for Indian languages, but to be extended for other languages). The future version are likely to focus more on the shell mode of operation, rather than GUI based applications, although those will also continue to be added. But since a lot of new things have been added, there may be some things which don't work properly (as usual).



6th Oct, 2009: There is one more component integrated with the Sanchay GUI now: a verb frameset editor compatible with Cornerstone. Also, the Syntactic Annotation interface can now be run in frameset annotation mode (look for a checkbox below the tree area). The frameset editor is connected with the annotation interface (somewhat like Jubilee) such that new frameset files can be created while annotating if there is no existing file. The find and replace facility on the annotation interface has been improved too, though there might be still some issues. I will try to add keyboard shortcuts for more things soon so that annotation can be easier and faster. And, of course, documentation...