### README.txt ###
### 14 Sep 2009 ###
### v0.2-beta ###
###################
### DOWNLOAD ###
#Download the source file
wget http://www.idi.ntnu.no/~satre/akane/AkaneRE-beta.tgz
#into a folder on your Unix, Linux, Mac, or CygWin system.
### INSTALL ###
#This creates a folder AkaneRE in the current directory,
# and puts everything there
tar -xvzf AkaneRE-bate.tgz
cd AkaneRE
make
less README.txt
### USAGE ###
#This way, you can watch the results and other debugging information in the AIMed.output.txt file
./akane script REMerge/smbm2akane/AImed.enju-ksdep.config.xml 2>AIMed.output.txt &
#On Sun-Grid:
# qrsh -nostdin ./akane script ......config.xml 2\>AIMed.output.txt &
### FILES ###
AkaneRE/ Main Folder, contains everything!
..Makefile type "make" to create akane executable
..README.txt you are reading the information in it right now
..HOWTO.txt TODO: things someone should take care of
..main.cpp starts everything, look here if you wonder which code to look at
..config.cpp reads the config.xml-file and sets all parameters for akane
..evaluation.cpp Does all the 10-fold calculations etc
..features.cpp extracting graph, path and Bag-Of-Word features from the parsers
..files.cpp handles the use of multiple inputfiles organized in folder structures
..namedEntities.cpp methods for dealing with custom Named Entity types
..parser.cpp coordinates the use of multiple parsers for better scores
..path.cpp supporting methods for handling parser graphs and paths
..reader.cpp reads all the SO and Parser inputfiles in different formats
..relations.cpp methods for dealing with custom Relation types
..results.cpp methods for dealing with custom Result and output types
..server.cpp support for never-ending reading from STDIO and predicting to STDOUT
..standoff.cpp general template for handling StandOff-annotated constituents, words..
..svm.cpp runs the machine learning and prediction using svm-light-GK
..syntax.cpp is a goldmine if you ever want to process Enju SO output
It changes SO into useful objects corresponding to Sentences,
Words, Part-Of-Speech Tokens, constituents, etc
..timer.cpp methods for timing different parts of the program execution
..writer.cpp prints all results and other informative files
..*.h corresponding header files with Public Methods and Constants
..REMerge/ REMerge folder, contains 5 corpora, pre-parsed by enju and gdep
...*.config.xml Configuration files for easy running the akane PPI-predictions
can be changed to do exactly what you want (without recompiling)
...*.xml One corpus in XML format
...*.so Stand-Off tags for the corpus, matching the .txt file
...*.txt Text file matching the .so file, Whole corpus in single file SO-format
...*.enju.so Parse-output from enju
...*.kenji.so Parse-output from enju
...*.FNpairs.txt sample output showing how error-analysis can be done during
...*.FPpairs.txt 10-fold crossvalidation.
...*.TPpairs.txt the output of these files is controlled from the config.xml file
...*.output.txt output from an actual 10-fold crossvalidation with results
svm-light-GK.zip SVMlight with GraphKernels, modified from Alessandro Moschitti
http://dit.unitn.it/~moschitt/Tree-Kernel.htm
###Third Party Software
tinyxml_2_5_3.tar.gz TinyXML from Sourceforge
http://downloads.sourceforge.net/tinyxml/tinyxml_2_5_3.tar.gz
enju-2.2.tar.gz Enju from Tsujii Laboratory at Univeristy of Tokyo
liblilfes-1.3.4.tar.gz LilFes is integrated in Enju (from Tsujii Laboratory)
StandOffManager.tar.gz a nice program to convert between XML and SO formats
http://www.idi.ntnu.no/~satre/akane/StandOffManager-0.2.tgz
### RANDOM INFORMATION ###
### SVN ###
svn co file:///home/svn/satre/AkaneRE/trunk AkaneRE
### USAGE ###
#If you want to get rid of the source files for Enju and LilFes you can
make clean
#If you want to get rid of the tar, tgz and zipped files too, use
make clean_source
### PARSING ###
#Enju
(I'm using the free parser "enju", 30min: http://www-tsujii.is.s.u-tokyo.ac.jp/enju/ )
cat AIMed/AIMed.nerTokenized.txt | enju -genia -so > AIMed/AIMed.nerTokenized.enju.so &
(Another good fast option is the "mogura" version of "enju", 2min)
cat AIMed/AIMed.nerTokenized.txt | mogura -genia -so > AIMed/AIMed.nerTokenized.enju.so &
#KSDep / GDep
( or the Genia Dependency parser "GDep": http://www.cs.cmu.edu/~sagae/parser/gdep/ )
cat AIMed/AIMed.nerTokenized.txt | gdep > AIMed/AIMed.nerTokenized.gdep.so &
./dep2so.prl AIMed/AIMed.nerTokenized.dep AIMed/AIMed.nerTokenized.txt > AIMed/AIMed.nerTokenized.gdep.so
#First create conll-X format input, and parse:
cat test.txt | geniatagger | ./ksdep/pos2conll.prl | ./ksdep/ksdep -m genia.mod > OUTPUTFILE.kenji.dep.txt
#Then make SO-format:
./dep2so.prl OUTPUTFILE.kenji.dep.txt test.txt > test.kenji.so.txt
### RUNTIME STATISTICS ###
#AIMed
Processing the AIMed corpus with 10-fold cross validation takes
-100 seconds in linear-svm mode ()
-20 minutes in tree-kernel svm mode ()
#BioNLP
When experimenting with BioNLP shared task: 8000 sentences, 150 templates in 9 groups
(The Label-4 group contained 68 ambiguous templates, with 2.5GB training features created)
The training took 84 minutes, The memory consumption was 40 GigaByte.
### DEBUGGING ###
#For GnuDeBugger (gdb):
gdb akane
set args script REMerge/smbm2akane/AImed.enju-ksdep.config.xml
run
#Usage: -C g, means combination of Kernels, without Graph kernel.
./akane script REMerge/smbm2akane/AImed.enju-ksdep.config.xml 2>AIMed.output.txt &
./akane script REMerge/smbm2akane/AImed.enju-ksdep.config.xml -C g 2>AIMed.output.txt &