RussianNamedEntityRecognition

Installation

clone [email protected]:al-pacino/RussianNamedEntityRecognition.git
cd RussianNamedEntityRecognition

Download mystem (https://tech.yandex.ru/mystem/#download) and put it to the RussianNamedEntityRecognition directory
Build CRF++

tar xvf CRF++-0.58.tar
cd CRF++-0.58
./configure && make

Build NamedEntityRecognition program (WINDOWS: Visual Studio project available in folder vs2010)

/build.sh

cd script
unzip test-texts.zip
./test.py

A trained model file (model.crf-model) exists in root of the repository.

After running you should have 5 new files for all source text files in the test-texts directory:

*.cp1251 - the file contains source text in cp1251 encoding
*.json - the file contains the result of processing source text in cp1251 encoding by mystem analyzer
*.signs - the file contains signs which have been extracted by main program
*.crf-tested - the file contains the result of application crf_test to *.txt.cp1251.json.signs
*.task1 - the file contains named entities which have been extracted by main program

Each line of an *.task1 file has the structure: TYPE OFFSET LENGTH

Where:

TYPE is one of three named entity types: ORG (organization), LOC (location), PER (person);
OFFSET is offset in bytes from the beginning of the file;
LENGTH is length of the text of the named entity;

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
lowercase-cp1251-aux-files		lowercase-cp1251-aux-files
rapidjson		rapidjson
scripts		scripts
src		src
vs2010		vs2010
.gitignore		.gitignore
CRF++-0.58.tar		CRF++-0.58.tar
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
model.crf-model		model.crf-model
report.pdf		report.pdf
signs.txt		signs.txt
template.crf-template		template.crf-template
test-texts.zip		test-texts.zip
train-texts.zip		train-texts.zip