### README.txt ###
### 14 Sep 2009 ###
### v0.2-beta  ###
###################

### DOWNLOAD ###
#Download the source file
wget http://www.idi.ntnu.no/~satre/akane/AkaneRE-beta.tgz
#into a folder on your Unix, Linux, Mac, or CygWin system.


### INSTALL ###
#This creates a folder AkaneRE in the current directory,
# and puts everything there
tar -xvzf AkaneRE-bate.tgz
cd AkaneRE
make
less README.txt


### USAGE ###
#This way, you can watch the results and other debugging information in the AIMed.output.txt file

./akane script REMerge/smbm2akane/AImed.enju-ksdep.config.xml 2>AIMed.output.txt &

#On Sun-Grid:
# qrsh -nostdin ./akane script ......config.xml 2\>AIMed.output.txt &


### FILES ###
AkaneRE/     Main Folder, contains everything!
..Makefile     type "make" to create akane executable
..README.txt    you are reading the information in it right now
..HOWTO.txt     TODO: things someone should take care of
..main.cpp     starts everything, look here if you wonder which code to look at
..config.cpp    reads the config.xml-file and sets all parameters for akane
..evaluation.cpp  Does all the 10-fold calculations etc
..features.cpp   extracting graph, path and Bag-Of-Word features from the parsers
..files.cpp     handles the use of multiple inputfiles organized in folder structures
..namedEntities.cpp methods for dealing with custom Named Entity types
..parser.cpp    coordinates the use of multiple parsers for better scores
..path.cpp     supporting methods for handling parser graphs and paths
..reader.cpp    reads all the SO and Parser inputfiles in different formats
..relations.cpp   methods for dealing with custom Relation types
..results.cpp    methods for dealing with custom Result and output types
..server.cpp    support for never-ending reading from STDIO and predicting to STDOUT
..standoff.cpp   general template for handling StandOff-annotated constituents, words..
..svm.cpp      runs the machine learning and prediction using svm-light-GK
..syntax.cpp    is a goldmine if you ever want to process Enju SO output
          It changes SO into useful objects corresponding to Sentences,
          Words, Part-Of-Speech Tokens, constituents, etc
..timer.cpp     methods for timing different parts of the program execution
..writer.cpp    prints all results and other informative files
..*.h        corresponding header files with Public Methods and Constants

..REMerge/     REMerge folder, contains 5 corpora, pre-parsed by enju and gdep
...*.config.xml   Configuration files for easy running the akane PPI-predictions
          can be changed to do exactly what you want (without recompiling)
...*.xml    One corpus in XML format
...*.so     Stand-Off tags for the corpus, matching the .txt file
...*.txt    Text file matching the .so file, Whole corpus in single file SO-format
...*.enju.so  Parse-output from enju
...*.kenji.so  Parse-output from enju
...*.FNpairs.txt sample output showing how error-analysis can be done during
...*.FPpairs.txt 10-fold crossvalidation.
...*.TPpairs.txt the output of these files is controlled from the config.xml file
...*.output.txt  output from an actual 10-fold crossvalidation with results

svm-light-GK.zip      SVMlight with GraphKernels, modified from Alessandro Moschitti
              http://dit.unitn.it/~moschitt/Tree-Kernel.htm

  ###Third Party Software
tinyxml_2_5_3.tar.gz    TinyXML from Sourceforge
              http://downloads.sourceforge.net/tinyxml/tinyxml_2_5_3.tar.gz
enju-2.2.tar.gz      Enju from Tsujii Laboratory at Univeristy of Tokyo 
liblilfes-1.3.4.tar.gz   LilFes is integrated in Enju (from Tsujii Laboratory)
StandOffManager.tar.gz   a nice program to convert between XML and SO formats
              http://www.idi.ntnu.no/~satre/akane/StandOffManager-0.2.tgz

### RANDOM INFORMATION ###

### SVN ###
svn co file:///home/svn/satre/AkaneRE/trunk AkaneRE


### USAGE ###

#If you want to get rid of the source files for Enju and LilFes you can
make clean

#If you want to get rid of the tar, tgz and zipped files too, use
make clean_source


### PARSING ###

#Enju
(I'm using the free parser "enju", 30min: http://www-tsujii.is.s.u-tokyo.ac.jp/enju/ )
cat AIMed/AIMed.nerTokenized.txt | enju -genia -so > AIMed/AIMed.nerTokenized.enju.so &

(Another good fast option is the "mogura" version of "enju", 2min)
cat AIMed/AIMed.nerTokenized.txt | mogura -genia -so > AIMed/AIMed.nerTokenized.enju.so &


#KSDep / GDep
( or the Genia Dependency parser "GDep": http://www.cs.cmu.edu/~sagae/parser/gdep/ )
cat AIMed/AIMed.nerTokenized.txt | gdep > AIMed/AIMed.nerTokenized.gdep.so &
./dep2so.prl AIMed/AIMed.nerTokenized.dep AIMed/AIMed.nerTokenized.txt > AIMed/AIMed.nerTokenized.gdep.so

#First create conll-X format input, and parse:
cat test.txt | geniatagger | ./ksdep/pos2conll.prl | ./ksdep/ksdep -m genia.mod > OUTPUTFILE.kenji.dep.txt
#Then make SO-format:
./dep2so.prl OUTPUTFILE.kenji.dep.txt test.txt > test.kenji.so.txt


### RUNTIME STATISTICS ###
#AIMed
Processing the AIMed corpus with 10-fold cross validation takes
-100 seconds in linear-svm mode ()
-20 minutes in tree-kernel svm mode ()

#BioNLP
When experimenting with BioNLP shared task: 8000 sentences, 150 templates in 9 groups
(The Label-4 group contained 68 ambiguous templates, with 2.5GB training features created)
The training took 84 minutes, The memory consumption was 40 GigaByte.



### DEBUGGING ###
#For GnuDeBugger (gdb):
gdb akane
set args script REMerge/smbm2akane/AImed.enju-ksdep.config.xml
run

#Usage: -C g, means combination of Kernels, without Graph kernel.
./akane script REMerge/smbm2akane/AImed.enju-ksdep.config.xml 2>AIMed.output.txt &

./akane script REMerge/smbm2akane/AImed.enju-ksdep.config.xml -C g 2>AIMed.output.txt &