web-services / data-computer /
@Nicolas Thouvenin Nicolas Thouvenin authored on 20 Oct 2023
..
v1 perf(data-computer): deleted french stopwords from git 1 year ago
README.md docs(data-computer): changed weight description in readme for lda 1 year ago
example-json.tar.gz rename directory 1 year ago
examples.http Merge branch 'tdm-add-generate-example-tests' 1 year ago
requirements.txt fix(data-computer): fix version required for lemmatizer (requirements) 1 year ago
swagger.json change url 1 year ago
tests.hurl docs(data-wrapper): Lint README and examples 1 year ago
README.md

data-computer

L'instance data-computer utilise l'application ezmaster [lodex-workers`](https://github.com/Inist-CNRS/lodex-workers).

Elle offre plusieurs services asynchrones pour des calculs et de transformations de données simples.

Tous les services proposés acceptent uniquement en entrée des fichiers corpus standards au format tar.gz.

Configuration

Il faut préciser dans le fichier de configuration de l'instance qu'elle utilise les paquets nodejs suivant :

  • @ezs/analytics
  • @ezs/basics

Bien sûr, les dernières versions sont préférables.

Exemple:

{
    "packages": [
        "@ezs/core@3.0.5",
        "@ezs/analytics@2.1.0",
        "@ezs/basics@2.5.3",
        "@ezs/spawn@1.4.4"
    ]
}

Utilisation

v1/tree-segment

Créer des segments glissant 2 à 2 de tous les éléments d'un tableau et agrège ces segments pour les compter.

Le segment étant glissant, ce traitement sert à créer des segments qui représente un arbre hiérachique.

par exemple avec ces données en entrée:

[
    { "value": ["a", "b", "c"] },
    { "value": ["a", "c", "d"] },
    { "value": ["a", "b", "d"] },
    { "value": ["a", "b", "c", "d"] },
    { "value": ["a", "c", "d", "e"] }
]

on obtiendra :

[
    {"source":"a","target":"b","weight":3,"origin":["#1","#3","#4"]},
    {"source":"b","target":"c","weight":2,"origin":["#1","#4"]},
    {"source":"a","target":"c","weight":2,"origin":["#2","#5"]},
    {"source":"c","target":"d","weight":3,"origin":["#2","#4","#5"]},
    {"source":"b","target":"d","weight":1,"origin":["#3"]},
    {"source":"d","target":"e","weight":1,"origin":["#5"]}
]

NOTE: Le service accepte des tableaux de tableaux (cas d'usage lodex/istex)

Paramètre(s) URL

nom description
indent (true/false) Indenter le résultat renvoyer immédiatement

Entête(s) HTTP

nom description
X-Hook URL à appeler quand le résultat sera disponible (facultatif)

Exemple en ligne de commande

# Send data for batch processing
cat input.tar.gz |curl --data-binary @-  -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/tree-segment" > output.json

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz

v1/graph-segment

Créer des segments 2 à 2 avex tous les éléments d'un tableau et agrège ces segments pour les compter Les segments reprsentent toutes la associations possibles, ce traitement sert à créer des segments qui représente un réesau.

par exemple avec ces données en entrée:

[
    { "value": ["a", "b", "c"] },
    { "value": ["a", "c", "d"] },
    { "value": ["a", "b", "d"] },
    { "value": ["a", "b", "c", "d"] },
    { "value": ["a", "c", "d", "e"] }
]

on obtiendra :

[
    {"source":"a","target":"b","weight":3,"origin":["#1","#3","#4"]},
    {"source":"a","target":"c","weight":4,"origin":["#1","#2","#4","#5"]},
    {"source":"b","target":"c","weight":2,"origin":["#1","#4"]},
    {"source":"a","target":"d","weight":4,"origin":["#2","#3","#4","#5"]},
    {"source":"c","target":"d","weight":3,"origin":["#2","#4","#5"]},
    {"source":"b","target":"d","weight":2,"origin":["#3","#4"]},
    {"source":"a","target":"e","weight":1,"origin":["#5"]},
    {"source":"c","target":"e","weight":1,"origin":["#5"]},
    {"source":"d","target":"e","weight":1,"origin":["#5"]}
]

NOTE: Le service accepte des tableaux ou des tableaux de tableaux

Paramètre(s) URL

nom description
indent (true/false) Indenter le résultat renvoyer immédiatement

Entête(s) HTTP

nom description
X-Hook URL à appeler quand le résultat sera disponible (facultatif)

Exemple en ligne de commande

# Send data for batch processing
cat input.tar.gz |curl --data-binary @-  -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/graph-segment" > output.json

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz

v1/lda

Créer à partir de l'ensemble des documents un champ "lda" constitué de 5 topics. Chaque topic contient un champ "word", qui est composé une liste de 10 mots qui sont les plus caractéristiques du topic, ainsi que d'un champ "weight" qui correspond au poids associé au sujet dans le document. Le texte doit être en anglais.

Par exemple, pour un document pris dans un ensemble de document (l'id "85" est totalement arbitraire)

{
  "id": 85, 
  "value": "During my culinary adventure through the bustling markets of Marrakech, where the scent of exotic spices hung in the air and vendors beckoned with colorful displays of fruits and textiles, I savored tagines, couscous, and mint tea, discovering the rich tapestry of Moroccan flavors."
}

On obtiendra :

{
  "id":85,
  "value": "During my culinary adventure through the bustling markets of Marrakech, where the scent of exotic spices hung in the air and vendors beckoned with colorful displays of fruits and textiles, I savored tagines, couscous, and mint tea, discovering the rich tapestry of Moroccan flavors.",
  "lda": {
    "topic_1": {
      "words": [
        "sky",
        "tranquil",
        "yellow",
        "solace",
        "symphony",
        "leave",
        "bird",
        "taxi",
        "cityscape",
        "provide"
      ],
      "weight": "0.0133591"
    },
    "topic_2": {
      "words": [
        "bustling",
        "air",
        "savor",
        "tapestry",
        "rich",
        "adventure",
        "tea",
        "discover",
        "flavor",
        "hang"
      ],
      "weight": "0.94660753"
    },
    "topic_3": {
      "words": [
        "street",
        "air",
        "cottage",
        "quaint",
        "melodic",
        "seaside",
        "water",
        "shore",
        "collect",
        "sandy"
      ],
      "weight": "0.013361818"
    },
    "topic_4": {
      "words": [
        "forest",
        "atmosphere",
        "leave",
        "filter",
        "tale",
        "tower",
        "create",
        "floor",
        "enchant",
        "shadow"
      ],
      "weight": "0.013335978"
    },
    "topic_5": {
      "words": [
        "mystery",
        "sky",
        "embark",
        "ponder",
        "gaze",
        "overwhelming",
        "light",
        "mountaintop",
        "night",
        "universe"
      ],
      "weight": "0.013335522"
    }
  }
}

NOTE : l'algorithme a besoin de beaucoup de documents pour fonctionner (plus d'une centaine), d'où la non exhaustivité de l'exemple.

Paramètre(s) URL

nom description
indent (true/false) Indenter le résultat renvoyer immédiatement

Entête(s) HTTP

nom description
X-Hook URL à appeler quand le résultat sera disponible (facultatif)

Exemple en ligne de commande

# Send data for batch processing
cat input.tar.gz |curl --data-binary @-  -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/lda" > output.json

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz