IRC3/EnglishVersion.md at b717e7879264c3c8f6b2d300d8ca327941a3bbe9

Fork: 0
scodex / IRC3
Find file
Newer
Older
IRC3 / EnglishVersion.md
besagni on 9 Jan 2020 2 KB Ajout de la version anglaise de README.md
Raw Blame History
IRC3
===============

**IRC3** (_**I**ndexation par **R**echerche et **C**omparaison de **C**haînes de **C**aractères_ = indexing by search and comparison of character strings) is a simple and robust programme to search and extract from a corpus of text files the fixed expressions — as chemicals, scientific names of animals or plants, author names, etc. — belonging to a finite list. 

**N.B.**: the list of terms and the different texts must be in **UTF-8** (without [BOM](https://en.wikipedia.org/wiki/Byte_order_mark)). 

### Usage
```bash
    IRC3.pl -t table -r directory [ -e extension ]* [ -s output_file ] [ -l log ] [ -cq ]
    IRC3.pl -t table -f input_file [ -s output_file ] [ -l log ] [ -cq ]
    IRC3.pl -t table [ -s output_file ] [ -l log ] [ -cq ]
    IRC3.pl -h
```

### Options
```text
    -c  considers the letter case (uppercase/lowercase) of searched 
        terms 
    -e  indicates the extension (e.g. “.txt”) of the text files to 
        process (you can have several extensions by repeating that 
        option) 
    -f  indicates the name of the input file to process 
    -h  displays that help 
    -l  indicates the name of the log file in which the number of found
        terms and occurrences is recorded
    -q  suppresses the display of the work progression (especially for 
        use in a script shell) 
    -r  indicates the directory containing the files to be processed 
    -s  indicates the name of the output file  
    -t  indicates the name of the file contining the resource, i.e. the 
        list of searched terms  
```

### Resource

The resource file contains one line per term. You can indicates the preferential form of a term by adding it at the end of the line after one or more tab characters. 

Empty lines or lines starting with the “#” character are not considered. Moreover, the resource may be a file compressed by `gzip` or `bzip2`.

### Result

The output file contains one line per found occurrence. Each line is formed of 4 tab-separated fields which are respectively:

* the name of the processed file  (“STDIN” for the standard input), 
* the term as it is in the resource, 
* the term as it appears in the analysed text, 
* the preferential form in the case of a synonym.