IRC3

IRC3 (Indexation par Recherche et Comparaison de Chaînes de Caractères = indexing by search and comparison of character strings) is a simple and robust programme to search and extract from a corpus of text files the fixed expressions — as chemicals, scientific names of animals or plants, author names, etc. — belonging to a finite list.

N.B.: the list of terms and the different texts must be in UTF-8 (without BOM).

Usage

    IRC3.pl -t table -r directory [ -e extension ]* [ -s output_file ] [ -l log ] [ -cq ]
    IRC3.pl -t table -f input_file [ -s output_file ] [ -l log ] [ -cq ]
    IRC3.pl -t table [ -s output_file ] [ -l log ] [ -cq ]
    IRC3.pl -h

Options

    -c  considers the letter case (uppercase/lowercase) of searched 
        terms 
    -e  indicates the extension (e.g. “.txt”) of the text files to 
        process (you can have several extensions by repeating that 
        option) 
    -f  indicates the name of the input file to process 
    -h  displays that help 
    -l  indicates the name of the log file in which the number of found
        terms and occurrences is recorded
    -q  suppresses the display of the work progression (especially for 
        use in a script shell) 
    -r  indicates the directory containing the files to be processed 
    -s  indicates the name of the output file  
    -t  indicates the name of the file containing the resource, i.e. the 
        list of searched terms

Resource

The resource file contains one term per line. You can indicates the preferential form of a term by adding it at the end of the line after one or more tab characters.

Empty lines or lines starting with the “#” character are not considered. Moreover, the resource may be a file compressed by gzip or bzip2.

Result

The output file contains one line per found occurrence. Each line is formed of 4 tab-separated fields which are respectively:

the name of the processed file (“STDIN” for the standard input),
the term as it is in the resource,
the term as it appears in the analysed text,
the preferential form in the case of a synonym.