IRC3 (Indexation par Recherche et Comparaison de Chaînes de Caractères = indexing by search and comparison of character strings) is a simple and robust programme to search and extract from a corpus of text files the fixed expressions — as chemicals, scientific names of animals or plants, author names, etc. — belonging to a finite list.
N.B.: the list of terms and the different texts must be in UTF-8 (without BOM).
IRC3.pl -t table -r directory [ -e extension ]* [ -s output_file ] [ -l log ] [ -cq ] IRC3.pl -t table -f input_file [ -s output_file ] [ -l log ] [ -cq ] IRC3.pl -t table [ -s output_file ] [ -l log ] [ -cq ] IRC3.pl -h
-c considers the letter case (uppercase/lowercase) of searched terms -e indicates the extension (e.g. “.txt”) of the text files to process (you can have several extensions by repeating that option) -f indicates the name of the input file to process -h displays that help -l indicates the name of the log file in which the number of found terms and occurrences is recorded -q suppresses the display of the work progression (especially for use in a script shell) -r indicates the directory containing the files to be processed -s indicates the name of the output file -t indicates the name of the file containing the resource, i.e. the list of searched terms
The resource file contains one term per line. You can indicates the preferential form of a term by adding it at the end of the line after one or more tab characters.
Empty lines or lines starting with the “#” character are not considered. Moreover, the resource may be a file compressed by gzip
or bzip2
.
The output file contains one line per found occurrence. Each line is formed of 4 tab-separated fields which are respectively: