diff --git a/EnglishVersion.md b/EnglishVersion.md new file mode 100644 index 0000000..c716672 --- /dev/null +++ b/EnglishVersion.md @@ -0,0 +1,48 @@ +IRC3 +=============== + +**IRC3** (_**I**ndexation par **R**echerche et **C**omparaison de **C**haînes de **C**aractères_ = indexing by search and comparison of character strings) is a simple and robust programme to search and extract from a corpus of text files the fixed expressions — as chemicals, scientific names of animals or plants, author names, etc. — belonging to a finite list. + +**N.B.**: the list of terms and the different texts must be in **UTF-8** (without [BOM](https://fr.wikipedia.org/wiki/Indicateur_d%27ordre_des_octets)). + +### Usage +``` + IRC3.pl -t table -r directory [ -e extension ]* [ -s output_file ] [ -l log ] [ -cq ] + IRC3.pl -t table -f input_file [ -s output_file ] [ -l log ] [ -cq ] + IRC3.pl -t table [ -e extension ]* [ -s output_file ] [ -l log ] [ -cq ] + IRC3.pl -h +``` + +### Options +``` + -c considers the letter case (uppercase/lowercase) of searched + terms + -e indicates the the extension (e.g. “.txt”) of the text files to + process (you can have several extensions by repeating that + option) + -f indicates the name of the input file to process + -h displays that help + -l indicates the name of the log file in which the number of found + terms and occurrences is recorded + -q suppresses the display of the work progression (especially for + use in a script shell) + -r indicates the directory containing the files to be processed + -s indicates the name of the output file + -t indicates the name of the file contining the resource, i.e. the + list of searched terms +``` + +### Resource + +The resource file contains one line per term. You can indicates the preferential form of a term by adding it at the end of the line after one or more tab characters. + +Empty lines or lines starting with the “#” character are not considered. Moreover, the resource may be a file compressed by `gzip` or `bzip2`. + +### Result + +The output file contains one line per found occurrence. Each line is formed of 4 tab-separated fields which are respectively: + +* the name of the processed file (“STDIN” for the standard input), +* the term as it is in the resource, +* the term as it appears in the analysed text, +* the preferential form in the case of a synonym.