diff --git a/testvp-data-computer/README.md b/testvp-data-computer/README.md new file mode 100644 index 0000000..47646a9 --- /dev/null +++ b/testvp-data-computer/README.md @@ -0,0 +1,215 @@ +# data-computer + +L'instance `data-computer utilise l'application ezmaster +[`lodex-workers`](https://github.com/Inist-CNRS/lodex-workers). + +Elle offre plusieurs services **asynchrones** pour des calculs et de transformations de données simples. + +*Tous les services proposés acceptent uniquement en entrée des fichiers corpus standards au format tar.gz.* + + +## Configuration + +Il faut préciser dans le fichier de configuration de l'instance qu'elle utilise les paquets nodejs suivant : + +- `@ezs/analytics` +- `@ezs/basics` + +Bien sûr, les dernières versions sont préférables. + +Exemple: + +```json +{ + "packages": [ + "@ezs/core@3.0.5", + "@ezs/analytics@2.1.0", + "@ezs/basics@2.5.3", + "@ezs/spawn@1.4.4" + ] +} +``` + +## Utilisation + +- [v1/tree-segment](#v1%2ftree-segment) +- [v1/graph](#v1%2fgraph-segment) +- [v1/lda](#v1%2flda) + +### v1/tree-segment + +Créer des segments glissant 2 à 2 de tous les éléments d'un tableau et agrège ces segments pour les compter. + +Le segment étant glissant, ce traitement sert à créer des segments qui représente un arbre hiérachique. + +par exemple avec ces données en entrée: + +```json +[ + { "value": ["a", "b", "c"] }, + { "value": ["a", "c", "d"] }, + { "value": ["a", "b", "d"] }, + { "value": ["a", "b", "c", "d"] }, + { "value": ["a", "c", "d", "e"] } +] + +``` + +on obtiendra : + +```json +[ + {"source":"a","target":"b","weight":3,"origin":["#1","#3","#4"]}, + {"source":"b","target":"c","weight":2,"origin":["#1","#4"]}, + {"source":"a","target":"c","weight":2,"origin":["#2","#5"]}, + {"source":"c","target":"d","weight":3,"origin":["#2","#4","#5"]}, + {"source":"b","target":"d","weight":1,"origin":["#3"]}, + {"source":"d","target":"e","weight":1,"origin":["#5"]} +] +``` + +> NOTE: Le service accepte des tableaux de tableaux (cas d'usage lodex/istex) + +#### Paramètre(s) URL + +| nom | description | +| ------------------- | ------------------------------------------- | +| indent (true/false) | Indenter le résultat renvoyer immédiatement | + +#### Entête(s) HTTP + +| nom | description | +| ------ | ------------------------------------------------------------ | +| X-Hook | URL à appeler quand le résultat sera disponible (facultatif) | + +#### Exemple en ligne de commande + + +```bash +# Send data for batch processing +cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/tree-segment" > output.json + +# When the corpus is processed, get the result +cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz + +``` + +### v1/graph-segment + +Créer des segments 2 à 2 avex tous les éléments d'un tableau et agrège ces segments pour les compter +Les segments reprsentent toutes la associations possibles, ce traitement sert à créer des segments qui représente un réesau. + +par exemple avec ces données en entrée: + +```json +[ + { "value": ["a", "b", "c"] }, + { "value": ["a", "c", "d"] }, + { "value": ["a", "b", "d"] }, + { "value": ["a", "b", "c", "d"] }, + { "value": ["a", "c", "d", "e"] } +] + +``` + +on obtiendra : + +```json +[ + {"source":"a","target":"b","weight":3,"origin":["#1","#3","#4"]}, + {"source":"a","target":"c","weight":4,"origin":["#1","#2","#4","#5"]}, + {"source":"b","target":"c","weight":2,"origin":["#1","#4"]}, + {"source":"a","target":"d","weight":4,"origin":["#2","#3","#4","#5"]}, + {"source":"c","target":"d","weight":3,"origin":["#2","#4","#5"]}, + {"source":"b","target":"d","weight":2,"origin":["#3","#4"]}, + {"source":"a","target":"e","weight":1,"origin":["#5"]}, + {"source":"c","target":"e","weight":1,"origin":["#5"]}, + {"source":"d","target":"e","weight":1,"origin":["#5"]} +] +``` + +> NOTE: Le service accepte des tableaux ou des tableaux de tableaux + +#### Paramètre(s) URL + +| nom | description | +| ------------------- | ------------------------------------------- | +| indent (true/false) | Indenter le résultat renvoyer immédiatement | + +#### Entête(s) HTTP + +| nom | description | +| ------ | ------------------------------------------------------------ | +| X-Hook | URL à appeler quand le résultat sera disponible (facultatif) | + +#### Exemple en ligne de commande + + +```bash +# Send data for batch processing +cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/graph-segment" > output.json + +# When the corpus is processed, get the result +cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz + +``` + + +### v1/lda + +Créer à partir de l'ensemble des documents un ensemble de topics. Chaque topic contient un champ "word", qui est composé une liste de 10 mots qui sont les plus caractéristiques du topic, ainsi que d'un champ "weight" qui correspond au poids associé au sujet dans le document. Le texte doit être en anglais. Les topics non exhaustifs (dont la probabilité est inférieure ou égale à 0.05) ne sont pas retournés. +La liste des topics sont affichés dans le champ "topics" et le topic avec la plus forte probabilité est retourné dans un champ "best_topic" + + +Par exemple, pour un document pris dans un ensemble de document (l'id "83" est totalement arbitraire) + +```json + +{ +"id":"83", +"value":"The current status and distribution of the red panda Ailurus fulgens in the wild is poorly known. The subspecies fulgens is found in the Himalaya in Nepal, India, Bhutan, northern Myanmar and southwest China, and the subspecies styani occurs further to the east in south-central China. The red panda is an animal of subtropical and temperate forests, with the exception of Meghalaya in India, where it is also found in tropical forests. In the wild, red pandas take a largely vegetarian diet consisting chiefly of bamboo. The extent of occurrence of the red panda in India is about 170,000 sq km, although its area of occupancy within this may only be about 25,000 sq km. An estimate based on the lowest recorded average density and the total area of potential habitat suggests that the global population of red pandas is about 16,000–20,000. Habitat loss and poaching, in that order, are the major threats. In this paper the distribution, status and conservation problems of the red panda, especially in India, are reviewed, and appropriate conservation measures recommended, including the protection of named areas and the extension of some existing protected areas." +} +``` + +On obtiendra : +```json + +{ +"id":"83", +"value":{ + "topics":{ + "topic_6":{"words":["diet","animal","high","group","level","study","blood","dietary","intake","increase"],"weight":"0.9416929"}, + "topic_13":{"words":["diet","intake","human","b12","food","level","protein","vitamin","increase","acid"],"weight":"0.05131816"} + }, + "best_topic": { + "topic_6":{"words":["diet","animal","high","group","level","study","blood","dietary","intake","increase"],"weight":"0.9416929"} + } +} +} +``` + +NOTE : La qualité des résultats dépend du corpus et les topics doivent être analysés par l'utilisateur avant d'être utilisés. + + +#### Paramètre(s) URL + +| nom | description | +| ------------------- | ------------------------------------------- | +| indent (true/false) | Indenter le résultat renvoyer immédiatement | + +#### Entête(s) HTTP + +| nom | description | +| ------ | ------------------------------------------------------------ | +| X-Hook | URL à appeler quand le résultat sera disponible (facultatif) | + +#### Exemple en ligne de commande + + +```bash +# Send data for batch processing +cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/lda" > output.json + +# When the corpus is processed, get the result +cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz +``` diff --git a/testvp-data-computer/example-json.tar.gz b/testvp-data-computer/example-json.tar.gz new file mode 100644 index 0000000..cb57cc3 --- /dev/null +++ b/testvp-data-computer/example-json.tar.gz Binary files differ diff --git a/testvp-data-computer/examples.http b/testvp-data-computer/examples.http new file mode 100644 index 0000000..325dbb3 --- /dev/null +++ b/testvp-data-computer/examples.http @@ -0,0 +1,94 @@ +# File Global Variables: Variables defined in Region without name or request +@baseUrl = http://localhost:31976 +#@baseUrl = https://data-computer.services.istex.fr/ + +### +# @name v1Retrieve +# @save +POST {{baseUrl}}/v1/retrieve HTTP/1.1 +Content-Type: application/json + +[ + { + "value":"8RjaJDej5" + } +] + +### +# @name v1RetrieveJSON +POST {{baseUrl}}/v1/retrieve-json?indent=true HTTP/1.1 +Content-Type: application/json + +[ + { + "value":"fRDWhEhay" + } +] +### +# @name v1RetrieveCSV +POST {{baseUrl}}/v1/retrieve-csv?indent=true HTTP/1.1 +Content-Type: text/csv + +[ + { + "value":"QjiWLMeKC" + } +] + +### +# @name v1baseLine +POST {{baseUrl}}/v1/base-line HTTP/1.1 +Content-Type: application/x-tar +X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 +X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 + +< ./example-json.tar.gz + +### +## @name v1mockerrorsync +POST {{baseUrl}}/v1/mock-error-sync HTTP/1.1 +Content-Type: application/x-tar +X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 +X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 + +< ./example-json.tar.gz + +### +### @name v1mockerrorasync +POST {{baseUrl}}/v1/mock-error-async HTTP/1.1 +Content-Type: application/x-tar +X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 +X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 + +< ./example-json.tar.gz + +### +# @name v1TreeSegment +POST {{baseUrl}}/v1/tree-segment HTTP/1.1 +Content-Type: application/x-tar +X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 +X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 + +< ./example-json.tar.gz + + +### +# @name v1GraphSegment +POST {{baseUrl}}/v1/graph-segment HTTP/1.1 +Content-Type: application/x-tar +X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 +X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 + +< ./example-json.tar.gz + + +### +# @name v1Lda +POST {{baseUrl}}/v1/lda HTTP/1.1 +Content-Type: application/x-tar +X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 +X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 + +< ./example-json.tar.gz + +### diff --git a/testvp-data-computer/requirements.txt b/testvp-data-computer/requirements.txt new file mode 100755 index 0000000..b1a47bb --- /dev/null +++ b/testvp-data-computer/requirements.txt @@ -0,0 +1,8 @@ +# gensim==4.3.2 +# spacy==3.6.1 +# en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0.tar.gz +# pandas==1.4.0 +# lxml==4.7.1 +# fr_core_news_sm @ https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.6.0/fr_core_news_sm-3.6.0-py3-none-any.whl + +# prometheus-client==0.19.0 \ No newline at end of file diff --git a/testvp-data-computer/swagger.json b/testvp-data-computer/swagger.json new file mode 100644 index 0000000..b1471b2 --- /dev/null +++ b/testvp-data-computer/swagger.json @@ -0,0 +1,37 @@ +{ + "openapi": "3.1.0", + "info": { + "title": "Privé : lda2ldasegment", + "summary": "Branche privée de data-computer pour test sur VP", + "version": "2.9.5", + "termsOfService": "https://services.istex.fr/", + "contact": { + "name": "Inist-CNRS", + "url": "https://www.inist.fr/nous-contacter/" + } + }, + "servers": [ + { + "x-comment": "Will be automatically completed by the ezs server." + }, + { + "url": "http://vptdmjobs.intra.inist.fr:49160/", + "description": "Production release", + "x-profil": "Standard" + }, + { + "url": "http://vitdmservices.intra.inist.fr:49313/", + "description": "For internal tests" + } + ], + "tags": [ + { + "name": "data-computer", + "description": "Calculs sur fichier corpus compressé", + "externalDocs": { + "description": "Plus de documentation", + "url": "https://gitbucket.inist.fr/tdm/web-services/tree/master/data-computer" + } + } + ] +} diff --git a/testvp-data-computer/tests.hurl b/testvp-data-computer/tests.hurl new file mode 100644 index 0000000..9f60dc6 --- /dev/null +++ b/testvp-data-computer/tests.hurl @@ -0,0 +1,40 @@ +# WARNING: This file was not generated, but manually written. +# DON'T OVERWRITE IT +# Use it to test: +# npx hurl --test data-computer/tests.hurl + +POST https://data-computer.services.istex.fr/v1/tree-segment +content-type: application/x-tar +x-hook: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9 +file,example-json.tar.gz; + +HTTP 200 +# Capture the computing token +[Captures] +computing_token: jsonpath "$[0].value" +[Asserts] +variable "computing_token" exists + +# There should be a waiting time, representing the time taken to process data. +# Fortunately, as the data is sparse, and the computing time is small, +# the need is small. + +# Version 4.1.0 of hurl added a delay option, which value is milliseconds. +# https://hurl.dev/blog/2023/09/24/announcing-hurl-4.1.0.html#add-delay-between-requests + +POST https://data-computer.services.istex.fr/v1/retrieve +content-type: application/json +[Options] +delay: 1000 +``` +[ + { + "value":"{{computing_token}}" + } +] +``` + +HTTP 200 +Content-Type: application/x-tar + +# TODO: ajouter les deux autres routes (v1GraphSegment, v1Lda) diff --git a/testvp-data-computer/v1/base-line.ini b/testvp-data-computer/v1/base-line.ini new file mode 100644 index 0000000..6749270 --- /dev/null +++ b/testvp-data-computer/v1/base-line.ini @@ -0,0 +1,59 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-base-line +post.summary = Chargement et analyse d'un fichier corpus +post.description = Le corpus est analysé et restitué sans modification des données +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = base-line + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[fork] +standalone = true +logger = logger.cfg + +# Step 2.0 (optionnel): Accélére le détachement du fork si l'enrichissement est lent +[fork/delegate] +file = buffer.cfg + +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[fork/transit] + +# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini +[fork/delegate] +file = recorder.cfg + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg + diff --git a/testvp-data-computer/v1/buffer.cfg b/testvp-data-computer/v1/buffer.cfg new file mode 100644 index 0000000..b291af8 --- /dev/null +++ b/testvp-data-computer/v1/buffer.cfg @@ -0,0 +1,29 @@ +[use] +plugin = basics + +# On sauvegarde sur disque pour accepter rapidement tous les objets en entrée +# et répondre rapidement au client que le traitmenent asynchnrone est lancé. +# +# Le "fork" se détache uniquement quand tous les objets sont "rentrés" dans le fork +# Si le traitement est plus lent que la sauvegarde sur disque +# il est nécessaire de créer un fichier temporaire +[pack] +[FILESave] +identifier = env('identifier') +location = /tmp/upload +compress = true + +[debug] +text = fix('Data received by', env('generator'), 'for', env('identifier')).join(' ') + +[exchange] +value = get('filename') + +[FILELoad] +compress = true +location = /tmp/upload +[unpack] + +[metrics] +bucket = buffer + diff --git a/testvp-data-computer/v1/charger-xml.cfg b/testvp-data-computer/v1/charger-xml.cfg new file mode 100644 index 0000000..b27b1f3 --- /dev/null +++ b/testvp-data-computer/v1/charger-xml.cfg @@ -0,0 +1,28 @@ +[use] +plugin = basics + +# Step 0 (générique) : Lire le fichier standard tar.gz +[TARExtract] +compress = true +path = */*.xml +json = false + +# Step 1 (générique) : Créer un identifiant unique pour le corpus reçu +[singleton] + +# Step 1.1 : On évite de récupere un champ uri existant +[singleton/env] +path = pid +value = fix(`PID${Date.now()}`) + +# Step 1.2 : On génére un identifiant unique +[singleton/identify] +path = env('pid') + +# Step 1.3: On garde en mémoire l'identifiant généré (en le simplifiant) +[singleton/env] +path = identifier +value = get(env('pid')).replace('uid:/', '') + +[metrics] +bucket = charger-xml diff --git a/testvp-data-computer/v1/charger.cfg b/testvp-data-computer/v1/charger.cfg new file mode 100644 index 0000000..228af1d --- /dev/null +++ b/testvp-data-computer/v1/charger.cfg @@ -0,0 +1,30 @@ +[use] +plugin = basics + +# Step 0 (générique) : Lire le fichier standard tar.gz +[TARExtract] +compress = true +path = */*.json + +# Step 1 (générique) : Créer un identifiant unique pour le corpus reçu +[singleton] + +# Step 1.1 : On évite de récupere un champ uri existant +[singleton/env] +path = pid +value = fix(`PID${Date.now()}`) + +# Step 1.2 : On génére un identifiant unique +[singleton/identify] +path = env('pid') + +# Step 1.3: On garde en mémoire l'identifiant généré (en le simplifiant) +[singleton/env] +path = identifier +value = get(env('pid')).replace('uid:/', '') + +[singleton/exchange] +value = self().omit([env('pid')]) + +[metrics] +bucket = charger diff --git a/testvp-data-computer/v1/graph-segment.ini b/testvp-data-computer/v1/graph-segment.ini new file mode 100644 index 0000000..4132d03 --- /dev/null +++ b/testvp-data-computer/v1/graph-segment.ini @@ -0,0 +1,97 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-graph-segment +post.summary = Création de segments à partir de tableaux +post.description = Le corpus est transformé en liste de segments (source, target, weight) à partir d'un tableau simple ou d'un tableau imbriqué +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = graph-segment + +[use] +plugin = basics +plugin = analytics + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 1.1 (spécifique): Controle du premier element en supposant que les autres lui ressemblent +[singleton] +[singleton/validate] +path = id +rule = required + +path = value +rule = required|array + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[fork] +standalone = true +logger = logger.cfg + +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[fork/delegate] + +# Step 2.1.1 (spécifique): S'assurer d'avoir un tableau +[fork/delegate/replace] +path = id +value = get('id') +path = value +value = get('value').thru(x => x && Array.isArray(x[0])?x:[x]).flatten().filter(Boolean) + + +# Step 2.1.2 (spécifique): Créer des tableaux de paires des segments (ou Bigramme) +[fork/delegate/graph] +path = value +identifier = id + +# Step 2.1.3 (spécifique): Regrouper les segments +[fork/delegate/aggregate] + +# Step 2.1.4 (spécifique): Construire un résulat spécifique du calcul +[fork/delegate/replace] +path = source +value = get('id.0') +path = target +value = get('id.1') +path = weight +value = get('value').size() +path = origin +value = get('value').uniq() + +[fork/transit] + +# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini +[fork/delegate] +file = recorder.cfg + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg + diff --git a/testvp-data-computer/v1/lda-segment.cfg b/testvp-data-computer/v1/lda-segment.cfg new file mode 100644 index 0000000..fcdfde1 --- /dev/null +++ b/testvp-data-computer/v1/lda-segment.cfg @@ -0,0 +1,40 @@ +[use] +plugin = basics +plugin = analytics +plugin = spawn + +# +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[exec] +# command should be executable ! +command = ./v1/lda.py + +# Step 2.1.1 (spécifique): Propose un reformatage de sortie pour usage simplifier dans lodex +[exchange] +value = get('value.topics').map((o, i) => _.zip(o.words, o.words_weights).map(x=>x.concat(i).concat(o.topic_weight))).flatten().map(y=> ({source: y[0], target: y[2], weight: y[1], origin: self.id})).filter(Boolean) +[ungroup] + +# Step 2.1.2 (spécifique): On regoupe les origin par segments identiques +[replace] +path = id +value = self().omit(['uuid', 'origin']) +path = value +value = get('origin') + +[aggregate] +path = value + +[replace] +path = source +value = get('id.source') + +path = target +value = get('id.target') + +path = weight +value = get('id.weight') + +path = origin +value = get('value') + + diff --git a/testvp-data-computer/v1/lda-segment.ini b/testvp-data-computer/v1/lda-segment.ini new file mode 100644 index 0000000..f86a7a2 --- /dev/null +++ b/testvp-data-computer/v1/lda-segment.ini @@ -0,0 +1,69 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-lda-segment +post.summary = Classifie un ensemble de documents parmi 5 topics et crée des segments entre les mots et les topics. +post.description = Créer à partir de l'ensemble des documents un champ `lda` constitué de 5 topics eux-mêmes caractérisés par 10 mots. +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = lda + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 1.1 (spécifique): Controle du premier element en supposant que les autres lui ressemblent +[singleton] +[singleton/validate] +path = id +rule = required + +path = value +rule = required|string + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[fork] +standalone = true +logger = logger.cfg + +# Step 2.0 (optionnel): Accélére le détachement du fork si l'enrichissement est lent +[fork/delegate] +file = buffer.cfg + +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[fork/delegate] +file = lda-segment.cfg + +# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini +[fork/delegate] +file = recorder.cfg + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg + diff --git a/testvp-data-computer/v1/lda.ini b/testvp-data-computer/v1/lda.ini new file mode 100644 index 0000000..5fe3822 --- /dev/null +++ b/testvp-data-computer/v1/lda.ini @@ -0,0 +1,62 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-lda +post.summary = Classifie un ensemble de documents parmi 5 topics. +post.description = Crée à partir de l'ensemble des documents un champ `lda` constitué de 5 _topics_ eux-mêmes caractérisés par 10 mots.^M> **Note**: Le texte doit être en anglais.^M^M> **Note 2**: La qualité des résultats dépend du corpus et les _topics_ doivent être analysés par l'utilisateur avant d'être utilisés. +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = lda + +[use] +plugin = basics +plugin = analytics +plugin = spawn + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[fork] +standalone = true +logger = logger.cfg + +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[fork/exec] +# command should be executable ! +command = ./v1/lda.py + +# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini +[fork/delegate] +file = recorder.cfg + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg + diff --git a/testvp-data-computer/v1/lda.py b/testvp-data-computer/v1/lda.py new file mode 100755 index 0000000..85e4364 --- /dev/null +++ b/testvp-data-computer/v1/lda.py @@ -0,0 +1,155 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +import json +import sys +from gensim import corpora, models +import unicodedata +import re +import spacy +from prometheus_client import CollectorRegistry, Counter, push_to_gateway + +registry = CollectorRegistry() +c = Counter('documents', 'Number of documents processed', registry=registry) +job_name='lda' + +nlp = spacy.load('en_core_web_sm', disable = ['parser','ner']) + +#stopwords +with open('./v1/stopwords/en.json','r') as f_in: + stopwords =json.load(f_in) + +#normalize text +def remove_accents(text): + if text == "" or type(text)!= str: + return "" + normalized_text = unicodedata.normalize("NFD", text) + text_with_no_accent = re.sub("[\u0300-\u036f]", "", normalized_text) + return text_with_no_accent + +def uniformize(text): + text = remove_accents(text) + + # remove punctuation except " ' " + text = ''.join(char if char.isalpha() or char == "'" else ' ' for char in text) + + return ' '.join(text.lower().split()) + +#lemmatize +def lemmatize(text): + if text == "": + return text + doc = nlp(text) + return " ".join([token.lemma_ for token in doc]) + +#tokenize +def tokenize(text): + tokens = [word for word in text.replace("'"," ").split() if word not in stopwords and len(word)>2] + if len(tokens)==0: + return [] + return tokens + +# Max topic +def max_topic(dico): + """ + for a dictionary of topics, return a json with a single key "best_topic" and its value is the value of this topic in the dictionary. + """ + best_topic = {} + best_proba = 0 + for topic in dico: + proba = float(dico[topic]["topic_weight"]) + if proba>best_proba: + best_proba = proba + best_topic = topic + return {best_topic:dico[best_topic]} + + + +# WS +# load all datas +all_data = [] +for line in sys.stdin: + data=json.loads(line) + all_data.append(data) + + +# following parameters depends of the size of the corpus : num_topics and num_iterations +len_data = len(all_data) +if len_data< 1001: + num_topics = 10 + num_iterations=150 +elif len_data < 20001: + num_topics = 15 + num_iterations=200 +else: + num_topics = 20 + num_iterations=250 + + +# training LDA +texts = [] +index_without_value = [] +for i in range(len_data): + line = all_data[i] + if "value" in line and type(line["value"])==str: + tokens = tokenize(lemmatize(uniformize(line["value"]))) + if tokens != []: + texts.append(tokenize(lemmatize(uniformize(line["value"])))) + else: + index_without_value.append(i) + else: + index_without_value.append(i) +dictionary = corpora.Dictionary(texts) # Create a tf dictionary, but replace text by an id : [ [(id_token,numb_token),...] , [....] ]. The list represent docs of corpus +dictionary.filter_extremes(no_below=3,no_above=0.5) +corpus = [dictionary.doc2bow(text) for text in texts] + +try: + lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary,iterations=num_iterations,alpha="symmetric", eta = "auto",minimum_probability=0.1) +except: + index_without_value = [i for i in range(len_data)] + + +# extract infos +for i in range(len_data): + + #return n/a if docs wasn't in model + if i in index_without_value: + line["value"]="n/a" + sys.stdout.write(json.dumps(line)) + sys.stdout.write("\n") + else: + c.inc() + push_to_gateway('jobs-metrics.daf.intra.inist.fr', job=job_name, registry=registry) + + line = all_data[i] + doc = line["value"] + doc_bow = dictionary.doc2bow(tokenize(uniformize(line["value"]))) + topics = lda_model[doc_bow] + topic_info = {} + for topic_id, topic_weight in topics: + topic_info[f"topic_{topic_id + 1}"] = {} + words = [] + words_weights = [] + for word, word_weight in lda_model.show_topic(topic_id): + words.append(word) + words_weights.append(str(word_weight)) + topic_info[f"topic_{topic_id + 1}"]["words"] = words + topic_info[f"topic_{topic_id + 1}"]["words_weights"] = words_weights + topic_info[f"topic_{topic_id + 1}"]["topic_weight"] = str(topic_weight) + + line["value"]={} + line["value"]["topics"]=topic_info + try: + line["value"]["best_topic"]=max_topic(topic_info) + except: + line["value"]["best_topic"]="n/a" + sys.stdout.write(json.dumps(line)) + sys.stdout.write("\n") + + +# #To see topics (to test it with a jsonl file) +# sys.stdout.write(json.dumps(lda_model.print_topics())) + +# #Get coherence +# cm = models.coherencemodel.CoherenceModel(model=lda_model, texts=texts, coherence='c_v') +# cm.get_coherence() +# exit() diff --git a/testvp-data-computer/v1/lda2ldasegment.cfg b/testvp-data-computer/v1/lda2ldasegment.cfg new file mode 100644 index 0000000..5e8b62d --- /dev/null +++ b/testvp-data-computer/v1/lda2ldasegment.cfg @@ -0,0 +1,36 @@ +[use] +plugin = basics +plugin = analytics +plugin = spawn + +# +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +# Step 2.1.1 (spécifique): Propose un reformatage de sortie pour usage simplifier dans lodex +[exchange] +value = get('value.topics').map((o, i) => _.zip(o.words, o.words_weights).map(x=>x.concat(i).concat(o.topic_weight))).flatten().map(y=> ({source: y[0], target: y[2], weight: y[1], origin: self.id})).filter(Boolean) +[ungroup] + +# Step 2.1.2 (spécifique): On regoupe les origin par segments identiques +[replace] +path = id +value = self().omit(['uuid', 'origin']) +path = value +value = get('origin') + +[aggregate] +path = value + +[replace] +path = source +value = get('id.source') + +path = target +value = get('id.target') + +path = weight +value = get('id.weight') + +path = origin +value = get('value') + + diff --git a/testvp-data-computer/v1/lda2ldasegment.ini b/testvp-data-computer/v1/lda2ldasegment.ini new file mode 100644 index 0000000..5edb94b --- /dev/null +++ b/testvp-data-computer/v1/lda2ldasegment.ini @@ -0,0 +1,69 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-lda2ldasegment +post.summary = Ne pas utiliser : Formatage des données lda en lda-segment. +post.description = Ne pas utiliser : formatage lda2ldasegment. +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = lda + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 1.1 (spécifique): Controle du premier element en supposant que les autres lui ressemblent +[singleton] +[singleton/validate] +path = id +rule = required + +path = value +rule = required + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[fork] +standalone = true +logger = logger.cfg + +# Step 2.0 (optionnel): Accélére le détachement du fork si l'enrichissement est lent +[fork/delegate] +file = buffer.cfg + +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[fork/delegate] +file = lda2ldasegment.cfg + +# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini +[fork/delegate] +file = recorder.cfg + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg + diff --git a/testvp-data-computer/v1/logger.cfg b/testvp-data-computer/v1/logger.cfg new file mode 100644 index 0000000..6e44e2d --- /dev/null +++ b/testvp-data-computer/v1/logger.cfg @@ -0,0 +1,55 @@ +; [use] +plugin = basics +plugin = analytics + +[metrics] +bucket = logger + +# On ne garde que la première erreur déclénchée +[shift] + +[debug] +text = Error trapped + +[assign] +path = body.identifier +value = env('identifier') + +path = body.generator +value = env('generator') + +path = body.error.type +value = get('type') + +path = body.error.scope +value = get('scope') + +path = body.error.message +value = get('message') + +path = env +value = env() + +[swing] +test = env('headers.x-webhook-failure').startsWith('http') + +[swing/URLFetch] +url = env('headers.x-webhook-failure').trim() +path = body +headers = Content-Type:application/json +target = result +retries = 5 +timeout = 30000 + +# On enregistre uniqument quelques informations (à supprimer pour avoir la trace complète) +[exchange] +value = get('body') + +[FILESave] +location = /tmp/retrieve +identifier = env('identifier') +jsonl = true +compress = false + +[debug] +text = Error was saved diff --git a/testvp-data-computer/v1/mock-error-async.ini b/testvp-data-computer/v1/mock-error-async.ini new file mode 100644 index 0000000..d32279a --- /dev/null +++ b/testvp-data-computer/v1/mock-error-async.ini @@ -0,0 +1,60 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-mock-error-async +post.summary = Simule une erreur synchrone, après que le traitement soit lancé. +post.description = Chargement, analyse d'un fichier corpus et simulation d'une erreur lors du traitement +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = mock-error-async + +[use] +plugin = basics +plugin = analytics + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[fork] +standalone = true +logger = logger.cfg + +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[fork/validate] +path = fake path +rule = required + +# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini +[fork/delegate] +file = recorder.cfg + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg diff --git a/testvp-data-computer/v1/mock-error-sync.ini b/testvp-data-computer/v1/mock-error-sync.ini new file mode 100644 index 0000000..bd8bab1 --- /dev/null +++ b/testvp-data-computer/v1/mock-error-sync.ini @@ -0,0 +1,52 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-mock-error-sync +post.summary = Simule une erreur synchrone, avant que le traitment soit lancé. +post.description = Chargement, analyse d'un fichier corpus et simulation d'une erreur directe d'analyse du fichier +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = mock-error-sync + +[use] +plugin = basics +plugin = analytics + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[validate] +path = fake path +rule = required + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg + diff --git a/testvp-data-computer/v1/pair-segment.ini b/testvp-data-computer/v1/pair-segment.ini new file mode 100644 index 0000000..ccc73fa --- /dev/null +++ b/testvp-data-computer/v1/pair-segment.ini @@ -0,0 +1,97 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-tree-segment +post.summary = Création de segments à partir des co-occurences de valeurs de plusieurs champs +post.description = Le corpus est transformé en liste de segments (source, target, weight) à partir des c-ooccurences de plusieurs champs +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = pair-segment + +[use] +plugin = basics +plugin = analytics + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 1.1 (spécifique): Controle du premier element en supposant que les autres lui ressemblent +[singleton] +[singleton/validate] +path = id +rule = required + +path = value +rule = required|array + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[fork] +standalone = true +logger = logger.cfg + +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[fork/delegate] + +# Step 2.1.1 (spécifique): S'assurer d'avoir des tableaux de tableaux +; [fork/delegate/replace] +; path = id +; value = get('id') +; path = value +; value = get('value').castArray().map(x => x && Array.isArray(x)?x:[x]) + +# Step 2.1.2 (spécifique): Créer des tableaux de co-occurrences + +[fork/delegate/pair] +path = get('value').map((val, idx) => `value.${idx}`) +identifier = id + +# Step 2.1.3 (spécifique): Regrouper les segments +[fork/delegate/aggregate] + +# Step 2.1.4 (spécifique): Construire un résulat spécifique du calcul +[fork/delegate/replace] +path = source +value = get('id.0') +path = target +value = get('id.1') +path = weight +value = get('value').size() +path = origin +value = get('value').uniq() + +[fork/transit] + +# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini +[fork/delegate] +file = recorder.cfg + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg + diff --git a/testvp-data-computer/v1/recipient.cfg b/testvp-data-computer/v1/recipient.cfg new file mode 100644 index 0000000..f723830 --- /dev/null +++ b/testvp-data-computer/v1/recipient.cfg @@ -0,0 +1,12 @@ +[use] +plugin = basics + +[shift] +[replace] +path = id +value = env('generator') +path = value +value = env('identifier') + +[JSONString] +indent = env('indent') diff --git a/testvp-data-computer/v1/recorder.cfg b/testvp-data-computer/v1/recorder.cfg new file mode 100644 index 0000000..b62e743 --- /dev/null +++ b/testvp-data-computer/v1/recorder.cfg @@ -0,0 +1,56 @@ +[use] +plugin = basics +plugin = analytics + +[singleton] +[singleton/debug] +text = fix('One first result received by', env('generator'), 'for', env('identifier')).join(' ') + +[metrics] +bucket = recorder + +# Step 2.2 (générique): Création d'un fichier résulat standard +[TARDump] +compress = true +manifest = fix({version: '1'}) +manifest = fix({identifier: env('identifier')}) +manifest = fix({generator: env('generator')}) + +# Step 2.3 (générique): Sauvegarder sur disque le résulat +[FILESave] +location = /tmp/retrieve +identifier = env('identifier') +jsonl = false +compress = false + +# Step 2.4 (générique): Signaler le fin du traitement via un appel à un webhook (si il a été précisé) +[swing] +test = env('headers.x-webhook-success').startsWith('http') + +# Step 2.4.1 (générique): Séléctionner les informations à envoyer au webhook +[swing/replace] +path = url +value = env('headers.x-webhook-success') +path = body +value = self().pick(['size', 'atime', 'mtime', 'ctime']).set('identifier', env('identifier')).set('generator', env('generator')).set('state', 'ready') + +[swing/debug] +text = fix('Result generated by', env('generator'), 'for', env('identifier')).join(' ') + +# Step 2.4.2 (générique): Envoyer la requète HTTP +[swing/URLFetch] +url = env('headers.x-webhook-success').trim() +path = body +headers = Content-Type:application/json +retries = 5 +timeout = 30000 + +# Step 2.4.3 (faculatif) : Ajouter une trace dans log +[swing/debug] +text = fix('WebHook triggered by', env('generator'), 'for', env('identifier')).join(' ') + +# Step 2.5 (faculatif) : Ajouter une trace dans log +[debug] +text = fix('Process completed by', env('generator'), 'for', env('identifier')).join(' ') + + diff --git a/testvp-data-computer/v1/retrieve-csv.ini b/testvp-data-computer/v1/retrieve-csv.ini new file mode 100644 index 0000000..2d4cd4c --- /dev/null +++ b/testvp-data-computer/v1/retrieve-csv.ini @@ -0,0 +1,35 @@ +# Entrypoint output format +mimeType = text/csv + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-retrieve-csv +post.description = Récupération d'un résultat produit sous forme d'un flux CSV +post.summary = Les traitements étant asynchrones le résultat, une fois créé, doit être récupéré par cette route +post.tags.0 = data-computer +post.responses.default.description = Fichier corpus en version CSV +post.requestBody.content.application/json.example.0.value = xMkWJX7GU +post.requestBody.content.application/json.schema.$ref = #/components/schemas/JSONStream +post.requestBody.required = true + +[use] +plugin = basics + +[JSONParse] +separator = * + +[exchange] +value = get('value') + +[FILELoad] +location = /tmp/retrieve + +[TARExtract] +compress = true +path = */*.json + +[exchange] +value = self().mapValues(value => typeof value === 'object' ? JSON.stringify(value) : value) + +[CSVString] +separator = fix(',') +format = strict diff --git a/testvp-data-computer/v1/retrieve-json.ini b/testvp-data-computer/v1/retrieve-json.ini new file mode 100644 index 0000000..5ae55b1 --- /dev/null +++ b/testvp-data-computer/v1/retrieve-json.ini @@ -0,0 +1,36 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-retrieve-json +post.description = Récupération d'un résultat produit sous forme d'un flux json +post.summary = Les traitements étant asynchrones le résultat, une fois créé, doit être récupéré par cette route +post.tags.0 = data-computer +post.responses.default.description = Fichier corpus au format JSON +post.requestBody.content.application/json.example.0.value = xMkWJX7GU +post.requestBody.content.application/json.schema.$ref = #/components/schemas/JSONStream +post.requestBody.required = true +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean + +[use] +plugin = basics + +[JSONParse] +separator = * + +[exchange] +value = get('value') + +[FILELoad] +location = /tmp/retrieve + +[TARExtract] +compress = true +path = */*.json + +[JSONString] +indent = env('indent') + diff --git a/testvp-data-computer/v1/retrieve.ini b/testvp-data-computer/v1/retrieve.ini new file mode 100644 index 0000000..56a26c3 --- /dev/null +++ b/testvp-data-computer/v1/retrieve.ini @@ -0,0 +1,28 @@ +# Entrypoint output format +mimeType = application/x-gzip +extension = tar.gz + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-retrieve +post.summary = Récupération d'un résultat produit sous forme d'un fichier corpus +post.description = Les traitements étant asynchrones le résultat une fois créé doit être récupéré par cette route +post.tags.0 = data-computer +post.responses.default.description = Fichier corpus au format tar.gz +post.responses.default.content.application/x-gzip.schema.type = string +post.responses.default.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/json.example.0.value = xMkWJX7GU +post.requestBody.content.application/json.schema.$ref = #/components/schemas/JSONStream +post.requestBody.required = true + +[use] +plugin = basics + +[JSONParse] +separator = * + +[exchange] +value = get('value') + +[FILELoad] +location = /tmp/retrieve + diff --git a/testvp-data-computer/v1/stopwords/en.json b/testvp-data-computer/v1/stopwords/en.json new file mode 100644 index 0000000..eeeb3b0 --- /dev/null +++ b/testvp-data-computer/v1/stopwords/en.json @@ -0,0 +1 @@ +["able", "about", "above", "abroad", "abstract", "according", "accordingly", "across", "actually", "adj", "after", "afterwards", "again", "against", "ago", "ahead", "ain", "all", "allow", "allows", "almost", "alone", "along", "alongside", "already", "also", "although", "always", "amid", "amidst", "among", "amongst", "and", "another", "any", "anybody", "anyhow", "anyone", "anything", "anyway", "anyways", "anywhere", "apart", "appear", "appreciate", "appropriate", "are", "aren", "around", "aside", "ask", "asking", "associated", "available", "away", "awfully", "back", "backward", "backwards", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "begin", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "both", "brief", "but", "came", "can", "cannot", "cant", "caption", "cause", "causes", "certain", "certainly", "changes", "clearly", "co", "com", "come", "comes", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn", "course", "currently", "dare", "daren", "definitely", "described", "despite", "did", "didn", "different", "directly", "does", "doesn", "doing", "done", "don", "down", "downwards", "during", "each", "edu", "eight", "eighty", "either", "else", "elsewhere", "end", "ending", "enough", "entirely", "especially", "etc", "even", "ever", "evermore", "every", "everybody", "everyone", "everything", "everywhere", "exactly", "example", "except", "fairly", "far", "farther", "few", "fewer", "fifth", "first", "five", "followed", "following", "follows", "for", "forever", "former", "formerly", "forth", "forward", "found", "four", "from", "further", "furthermore", "get", "gets", "getting", "given", "gives", "goes", "going", "gone", "got", "gotten", "greetings", "had", "hadn", "half", "happens", "hardly", "has", "hasn", "have", "haven", "having", "hello", "help", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "hither", "hopefully", "how", "howbeit", "however", "hundred", "ignored", "immediate", "inasmuch", "inc", "inc", "indeed", "indicate", "indicated", "indicates", "inner", "inside", "insofar", "instead", "into", "inward", "isn", "its", "itself", "just", "keep", "keeps", "kept", "know", "known", "knows", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "like", "liked", "likely", "likewise", "little", "look", "looking", "looks", "low", "lower", "ltd", "made", "mainly", "make", "makes", "many", "may", "maybe", "mayn", "mean", "meantime", "meanwhile", "merely", "might", "mightn", "mine", "minus", "miss", "more", "moreover", "most", "mostly", "mrs", "much", "must", "mustn", "myself", "name", "namely", "near", "nearly", "necessary", "need", "needn", "needs", "neither", "never", "neverf", "neverless", "nevertheless", "new", "next", "nine", "ninety", "nobody", "non", "none", "nonetheless", "noone", "noone", "nor", "normally", "not", "nothing", "notwithstanding", "novel", "now", "nowhere", "obviously", "off", "often", "okay", "old", "once", "one", "ones", "only", "onto", "opposite", "other", "others", "otherwise", "ought", "oughtn", "our", "ours", "ourselves", "out", "outside", "over", "overall", "own", "particular", "particularly", "past", "per", "perhaps", "placed", "please", "plus", "possible", "presumably", "probably", "provided", "provides", "que", "quite", "rather", "really", "reasonably", "recent", "recently", "regarding", "regardless", "regards", "relatively", "respectively", "right", "round", "said", "same", "saw", "say", "saying", "says", "second", "secondly", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "shall", "shan", "she", "should", "shouldn", "since", "six", "some", "somebody", "someday", "somehow", "someone", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "specified", "specify", "specifying", "still", "sub", "such", "sup", "sure", "take", "taken", "taking", "tell", "tends", "than", "thank", "thanks", "thanx", "that", "thats", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "theres", "thereupon", "these", "they", "thing", "things", "think", "third", "thirty", "this", "thorough", "thoroughly", "those", "though", "three", "through", "throughout", "thru", "thus", "till", "together", "too", "took", "toward", "towards", "tried", "tries", "truly", "try", "trying", "twice", "two", "under", "underneath", "undoing", "unfortunately", "unless", "unlike", "unlikely", "until", "unto", "upon", "upwards", "use", "used", "useful", "uses", "using", "usually", "value", "various", "versus", "very", "via", "viz", "want", "wants", "was", "wasn", "way", "welcome", "well", "went", "were", "weren", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "whichever", "while", "whilst", "whither", "who", "whoever", "whole", "whom", "whomever", "whose", "why", "will", "willing", "wish", "with", "within", "without", "wonder", "won", "would", "wouldn", "yes", "yet", "you", "your", "yours", "yourself", "yourselves", "zero", "uucp", "www", "amount", "bill", "bottom", "call", "computer", "con", "couldnt", "cry", "describe", "detail", "due", "eleven", "empty", "fifteen", "fifty", "fill", "find", "fire", "forty", "front", "full", "give", "hasnt", "herse", "himse", "interest", "itse\u201d", "mill", "move", "myse\u201d", "part", "put", "show", "side", "sincere", "sixty", "system", "ten", "thick", "thin", "top", "twelve", "twenty", "abst", "accordance", "act", "added", "adopted", "affected", "affecting", "affects", "announce", "anymore", "apparently", "approximately", "arent", "arise", "auth", "beginning", "beginnings", "begins", "biol", "briefly", "date", "effect", "etal", "fix", "gave", "giving", "heres", "hes", "hid", "home", "immediately", "importance", "important", "index", "information", "invention", "itd", "keys", "largely", "lets", "line", "means", "million", "mug", "nay", "necessarily", "nos", "noted", "obtain", "obtained", "omitted", "ord", "owing", "page", "pages", "poorly", "possibly", "potentially", "predominantly", "present", "previously", "primarily", "promptly", "proud", "quickly", "ran", "readily", "ref", "refs", "related", "research", "resulted", "resulting", "results", "run", "sec", "section", "shed", "shes", "showed", "shown", "showns", "shows", "significant", "significantly", "similar", "similarly", "slightly", "somethan", "specifically", "state", "states", "stop", "strongly", "substantially", "successfully", "sufficiently", "suggest", "thered", "thereof", "therere", "thereto", "theyd", "theyre", "thou", "thoughh", "thousand", "throug", "til", "tip", "ups", "usefully", "usefulness", "vol", "vols", "wed", "whats", "wheres", "whim", "whod", "whos", "widely", "words", "world", "youd", "youre"] \ No newline at end of file diff --git a/testvp-data-computer/v1/tree-segment.ini b/testvp-data-computer/v1/tree-segment.ini new file mode 100644 index 0000000..509cba7 --- /dev/null +++ b/testvp-data-computer/v1/tree-segment.ini @@ -0,0 +1,97 @@ +# Entrypoint output format +mimeType = application/json + +# OpenAPI Documentation - JSON format (dot notation) +post.operationId = post-v1-tree-segment +post.summary = Création de segments à partir de tableaux +post.description = Le corpus est transformé en liste de segments (source, target, weight) à partir d'un tableau simple ou d'un tableau imbriqué. +post.tags.0 = data-computer +post.requestBody.content.application/x-gzip.schema.type = string +post.requestBody.content.application/x-gzip.schema.format = binary +post.requestBody.content.application/x-tar.schema.type = string +post.requestBody.content.application/x-tar.schema.format = binary +post.requestBody.required = true +post.responses.default.description = Informations permettant de récupérer les données le moment venu +post.parameters.0.description = Indenter le JSON résultant +post.parameters.0.in = query +post.parameters.0.name = indent +post.parameters.0.schema.type = boolean +post.parameters.1.description = URL pour signaler que le traitement est terminé +post.parameters.1.in = header +post.parameters.1.name = X-Webhook-Success +post.parameters.1.schema.type = string +post.parameters.1.schema.format = uri +post.parameters.1.required = false +post.parameters.2.description = URL pour signaler que le traitement a échoué +post.parameters.2.in = header +post.parameters.2.name = X-Webhook-Failure +post.parameters.2.schema.type = string +post.parameters.2.schema.format = uri +post.parameters.2.required = false + +[env] +path = generator +value = tree-segment + +[use] +plugin = basics +plugin = analytics + +# Step 1 (générique): Charger le fichier corpus +[delegate] +file = charger.cfg + +# Step 1.1 (spécifique): Controle du premier element en supposant que les autres lui ressemblent +[singleton] +[singleton/validate] +path = id +rule = required + +path = value +rule = required|array + +# Step 2 (générique): Traiter de manière asynchnore les items reçus +[fork] +standalone = true +logger = logger.cfg + +# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus +[fork/delegate] + +# Step 2.1.1 (spécifique): S'assurer d'avoir des tableaux de tableaux +[fork/delegate/replace] +path = id +value = get('id') +path = value +value = get('value').thru(x => x && Array.isArray(x[0])?x:[x]) + +# Step 2.1.2 (spécifique): Créer des tableaux de paires des segments (ou Bigramme) +[fork/delegate/segment] +aggregate = false +path = value +identifier = id + +# Step 2.1.3 (spécifique): Regrouper les segments +[fork/delegate/aggregate] + +# Step 2.1.4 (spécifique): Construire un résulat spécifique du calcul +[fork/delegate/replace] +path = source +value = get('id.0') +path = target +value = get('id.1') +path = weight +value = get('value').size() +path = origin +value = get('value').uniq() + +[fork/transit] + +# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini +[fork/delegate] +file = recorder.cfg + +# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt +[delegate] +file = recipient.cfg +