[terms-extraction/teeft] termes inutilisables sur des corpus de chimie #31

Closed parmentf opened this issue on 7 Apr 2023 - 3 comments

@parmentf parmentf commented on 7 Apr 2023

Certains cas particuliers posent problème, en particulier sur les résumés d'articles de chimie.
Voir la carte Trello.

Exemples avec un tiret

[
  {
    "id": "ark:/67375/1BB-DQJ4CLQM-M",
    "value": "Newer reactions of coordinated peroxide at metal and non-metal centres are described. Reactions of peroxo-metal complexes with SO2 (g), NO2(g), and CO2(g) have been carried out in aqueous medium. Typically, reactions of a highly peroxygenated metal complex, A[V(O2)3]·3H2O (A=Na,K), follow an unprecedented sequence. The deep blue ESR-silent solution of A[V(O2)3]·3H2O reacts to produce a yellow, ESR-inactive solution that on further reaction with the chosen substrate affords a green-blue ESR-active (cf. VO2+) solution. The reaction proceeds through distinct steps such that, first, one of the coordinated peroxides undergoes a two-electron irreversible cleavage of the O-O bond leading to a diperoxy-mono(sulphato)vanadate(V) intermediate, [(O2)2V-O-SO3]−, that readily undergoes hydrolysis to produce H2SO4 and an aquaoxo-diperoxovanadate(V) complex, [VO(O2)2H2O]−. The latter complex reacts with more SO2(g) causing reduction of vanadium (V) to vanadium (IV) and conversion of coordinated peroxide to coordinated sulphate producing the bis(sulphato)vanadyl complex, [VO(SO2)2(H2O)2]2−. Further, the reaction of A[V(O2)3]·3H2O with SO2(g) in the presence of AF, yielding a ternary fluoro(sulphato)oxovanadate (IV) complex A4[VO(SO4)2F2(H2O)]·2H2O, serves as a paradigm for the synthesis of ternary complexes of vanadyl, VO2+. It is also evidentinteralia that the [V(O2)3]− species offers potential as a novel synthon. Some recent developments in the peroxo-chemistry of B, C, P and As are highlighted. Heretofore unreported salts of peroxo phosphoric acid, viz. (NH4)3[PO3(O2)]·3H2O and Na3[PO3(O2)]·3H2O, have been synthesized and their potential as oxidants explored. Their role in oxidising organic substances is highlighted, especially as a substitute for the alkaline-H2O2 reagent."
  }, {
    "id": "ark:/67375/1BB-H5X2LK6B-7",
    "value": "Two independently biologically active compounds, periplanone A and periplanone B can be isolated from fecal material of the American cockroach,Periplaneta americana. In fecal material these occur in a ratio of 1 ∶ 10 while, in intestinal tracts only periplanone B has been found. The latter has been identified as (1Z,5E)-1,10(14)-diepoxy-4(15),5-germacradien-9-one; the identification was confirmed by synthesis. Only the CD (−) enantiomer (1R,2R,5E,7S,10R) exhibited activity. The lower threshold of activity of both natural and synthetic CD (−) pheromone, is 10−6–10−7 μg. Periplanone A has been identified (apart from its stereochemical configuration) as 7-methylene-4-isopropyl-12-oxa-tricyclo[4.4.2.01,5]-9-dodecen-2-one). The structure of this rather unstable compound could be deduced by comparing its NMR, UV, IR, and mass spectra with the NMR and mass spectra of its rearrangement product. Both structures still require confirmation by synthesis, but their spectral data are in complete agreement with the proposed structures. The presence of only periplanone B in the gut and the presence of both periplanone A and periplanone B in the feces suggests that periplanone B is a genuine sex pheromone, whereas peri-planone A might be a biologically active transformation product, which in turn can isomerize into a more stable, but inactive compound."
  }, {
    "id": "ark:/67375/1BB-KQ0DLZZ5-N",
    "value": "The mixed valence Mn(III, IV) complexes, [Mn2O2L4]X3 with L=2,2-bipyridine or 1,10-phenanthroline and X=ClO 4 − or PF 6 − undergo partial ligand displacement reactions giving rise to the new compounds [Mn2O2L3A2]X3 with A=N, N-dimethylformamide or pyridine. The substitution is believed to take place at the labiled 4, Mn(III) centre. The substituted complexes have more deeply trapped valencies based on their electronic spectral characteristics. The EPR spectra are found to be essentially unaffected by ligand substitutions. Computer simulations of frozen solutions as well as polycrystalline spectra of the PF 6 − salts showing hyperfine splittings are presented. All the complexes evolve oxygen from water when present as a solid phase in contact with an aqueous solution containing Ce4+ions. The oxygen-evolving solution is found to contain MnO 4 − ions."
  }, {
    "id": "ark:/67375/6H6-02MSJV95-H",
    "value": "The opium alkaloid (−)-codeine was synthesized in eight steps from (±)-N-norreticuline R-(−)-norreticuline, obtained by resolution, was converted to (R)-N-trifluoroacetyl-6'-bromonorreticuline and the latter was subjected to phenolic oxidative coupling with a variety of aryliodoso complexes in dichloromethane. N-Trifluoroacetyl-1-bromonorsalutaridine prepared by this means was transformed to 1-bromosalutaridinol (as a mixture of epimers), and the latter were dehydrated separately to 1-bromothebaine with dimethylformamide dineopentyl acetal. Hydrolysis to 1-bromocodeinone, followed by reductive removal of Br with LAH, afforded (−)-codeine.\n\n−, -codeine, − -codeine, Biomimetic total synthesis, Opium alkaloid − -codeine, Eight steps, -n-norreticuline r- − -norreticuline, R -n-trifluoroacetyl-, Phenolic oxidative, Aryliodoso complexes"
  }
]

Quand on interroge terms-extraction/v1/teeft/en, on obtient:

[
  {
    "id": "ark:/67375/1BB-DQJ4CLQM-M",
    "value": [
      "vo",
      "−",
      "newer reactions",
      "non-metal centres",
      "peroxo-metal complexes"
    ]
  },
  {
    "id": "ark:/67375/1BB-H5X2LK6B-7",
    "value": [
      "periplanone",
      "periplanone b",
      "−",
      "fecal material",
      "mass spectra"
    ]
  },
  {
    "id": "ark:/67375/1BB-KQ0DLZZ5-N",
    "value": [
      "mn",
      "−",
      "valence mn iii iv complexes mn",
      "x clo",
      "partial ligand displacement reactions"
    ]
  },
  {
    "id": "ark:/67375/6H6-02MSJV95-H",
    "value": [
      "−",
      "-codeine",
      "opium alkaloid − -codeine",
      "eight steps",
      "-n-norreticuline r- − -norreticuline"
    ]
  }
]

On voit que les termes à un seul caractère sont inutiles -, (tiret long, apparemment).

Exemples avec un chevron

Ce sont vraisemblablement des puces dans l'abstract, qui ne portent pas de sens particulier, mais polluent le texte et génèrent des termes.

[
  {
    "id": "ark:/67375/6H6-03DZBN19-0",
    "value": "Highlights► We isolated and characterized a cell population from human nasal septal cartilage. ► Cells dwell on cartilage surface, which correspond to perichondrium cambium layer. ► Cells showed a more restricted lineage potential compared to mesenchymal stem cells. ► Cells pellets extracellular matrix is rich in collagen type II and glycosaminoglycans. ► No growth factors are needed to promote cell chondrogenic differentiation."
  }, {
    "id": "ark:/67375/6H6-03NTJXH0-V",
    "value": "Highlights► We adopt a process-driven view for ontology engineering. ► We construct domain ontologies on a cybernetic infrastructure. ► Ontological representation expresses and defines a target product as a metamodel. ► Thus knowledge can be used for representation and reasoning at functional level. ► This rationale is exemplified on biosensors."
  }, {
    "id": "ark:/67375/6H6-0GVR3MW1-M",
    "value": "Research Highlights► Peter Drucker's 1985 dictum that for several decades social change had been slower than technological change, has been turned on its head. Social change is now the hare and tech change the tortoise, relatively speaking. This will change again in the future. ► The law should continue to protect a particular patent only if the applicant or a licensee commercializes it within a short period of time following the granting of the patent. ► Research is needed on the mechanism by which innovations, which are nondifferentiable points on experience curves, aggregate to form seemingly smooth business cycles and Kondratieff waves. ► Technological Forecasting & Social Change will encourage research on measuring the rate of social change."
  }, {
    "id": "ark:/67375/6H6-0HBXQV4L-5",
    "value": "Graphical abstractHighlights► Recent advances in 2D patterned nanostructures based on monolayer colloidal crystals are reviewed. ► Particular attention is paid to the properties and applications of the MCC-based nanostructures. ► MCC-assisted fabrication at different two-phase interfaces is described. ► MCC-assisted assembly from preformed nanoscale building blocks is discussed."
  }
]

Quand on interroge terms-extraction/v1/teeft/en, on obtient:

[
  {
    "id": "ark:/67375/6H6-03DZBN19-0",
    "value": [
      "►",
      "► cells",
      "cell population",
      "human nasal septal cartilage",
      "cartilage surface"
    ]
  },
  {
    "id": "ark:/67375/6H6-03NTJXH0-V",
    "value": [
      "►",
      "process-driven view",
      "domain ontologies",
      "cybernetic infrastructure",
      "► ontological representation"
    ]
  },
  {
    "id": "ark:/67375/6H6-0GVR3MW1-M",
    "value": [
      "social change",
      "►",
      "research highlights►",
      "several decades social change",
      "technological change"
    ]
  },
  {
    "id": "ark:/67375/6H6-0HBXQV4L-5",
    "value": [
      "►",
      "nanostructures",
      "graphical abstracthighlights► recent advances",
      "monolayer colloidal crystals",
      "► particular attention"
    ]
  }
]

Suggestions:

  1. filtrer les termes trop courts (1 ou 2 caractères ?)
  2. supprimer certains caractères (mais comment décider ? les tirets peuvent être utiles, les chevrons non, mais les lettres grecques oui, ...)

D'après @cuxac les chevrons qui apparaissaient dans ISTEX sont un problème qui a été corrigé depuis.
On va donc considérer que c'est un nettoyage à faire en amont le cas échéant.

On va supprimer les termes qui ne font qu'un caractère.

Et Pascal a suggéré aussi de calculer un ratio caractère bizarre / caractère alphanumérique pour décider si on supprime le terme.
On peut juste calculer le pourcentage de caractères alphanumériques (et espace, voire ponctuation ?) dans le terme.
Ensuite, il faudra décider du seuil (Pascal a dit qu'il regarderait ce qu'il avait fait pour ISTEX).

Même si ces filtres pourraient être ajoutés dans le script actuel, ce serait sans doute plus maintenable d'ajouter une instruction ezs au package @ezs/teeft, du genre RemoveNonAlphaNumericTerms, et RemoveShortTerms (sur lesquelles on pourrait mettre des paramètres allowedCharacters et minSize).

Les exemples n'en montrent pas, mais on nous a aussi signalé des multitermes beaucoup trop longs pour être utiles.
Peut-être faut-il ajouter le corollaire de RemoveShortTerms avec RemoveLongTerms ?

@François Parmentier François Parmentier added a commit that referenced this issue on 17 Apr 2023
e6b50ce feat(terms-extraction): Improve teeft ...
@parmentf parmentf closed this issue on 17 Apr 2023
Labels

Priority
default
Milestone
No milestone
Assignee
No one
2 participants
@parmentf @François Parmentier