grobid-corpus/segmentation/public/tei/ecdl_dilia.training.segmentation.tei.xml at 2d5326389fa8c53ee0c077511af7eaaada7404bb

Fork: 0
istex / grobid-corpus
Find file
Newer
Older
grobid-corpus / segmentation / public / tei / ecdl_dilia.training.segmentation.tei.xml
zeynalig on 26 Apr 2017 10 KB initialisation des corpus
Raw Blame History
<?xml version="1.0" ?>
<tei>
	<teiHeader>
		<fileDesc xml:id="-1"/>
	</teiHeader>
	<text xml:lang="en">
			<front>DiLiA – The Digital Library Assistant <lb/> Kathrin Eichler, Holmer Hemsen, Günter Neumann, Norbert Reithinger, Sven <lb/>Schmeier, Kinga Schumacher, and Inessa Seifert <lb/> DFKI GmbH, Alt-Moabit 91c, 10559 Berlin, Germany <lb/> firstname.lastname@dfki.de, <lb/> DFKI home page: http://www.dfki.de <lb/> DiLiA home page: http://dilia.b.dfki.de/ <lb/> Abstract. In this paper we present the digital library assistant (DiLiA). <lb/>The system aims at augmenting the search in digital libraries in sev-<lb/>eral dimensions. In the project advanced information visualisation meth-<lb/>ods are developed for user controlled interactive search. The interaction <lb/>model has been designed in a way that it is transparent to the user and <lb/>easy to use. In addition, information extraction (IE) methods have been <lb/>developed in DiLiA to make the content more easily accessible, this in-<lb/>cludes the identification and extraction of technical terms (TTs) – single <lb/>and multi word terms – as well as the extraction of binary relations based <lb/>on the extracted terms. In DiLiA we follow a hybrid information extrac-<lb/>tion approach – a combination of metadata and document processing. <lb/></front>

			<body>1 Introduction <lb/> Although the content of digital libraries is growing rapidly, popular portals for <lb/>digital libraries, such as Google Scholar, Citeulike, ACM digital library still <lb/>limit the search options to a small set of meta labels (such as author, title, etc.) <lb/>and only provide a limited text-based search interface. So far, these portals do <lb/>not use any elaborated visualisation techniques for presenting the search results. <lb/>This is problematic in two ways. Firstly, since the search options are restricted to <lb/>metadata, a search query that is not specific enough will easily lead to a long list <lb/>of search results. Secondly, since no elaborated visualisation techniques are used, <lb/>navigating through the search result is difficult and time consuming. The goal of <lb/>DiLiA is to go beyond this level of information access. We especially target users <lb/>that want to interactively explore the content of the digital library, for example, <lb/>users that want to investigate a new research area. The DiLiA demonstrator is <lb/>based on real data in the computer science domain. The database contains 1.2 <lb/>million abstracts with corresponding metadata from DBLP. <lb/> 2 Visualisation <lb/> The development of the user interface has been led by the design principle that <lb/>the visual representations should provide clues: what can be done next and what <lb/>are the possible directions for further search [1]. The user interface consists of a <lb/> relational view, visualising relations between the search queries the user specifies; <lb/>a hyperlink activated list of search results with detailed information on each item; <lb/>and tools (e.g., bar charts) for a flexible analysis of the search results. <lb/> Fig. 1. User interface of the DiLiA demonstrator <lb/> Figure 1 shows the user interface (UI) of the DiLiA demonstrator. On the <lb/>right side of the UI the search panel with the search result list is located. Select-<lb/>ing an item in the result set shows its metadata. Users have different possibilities <lb/>for stating a search query. The search panel allows the user to enter a search <lb/>term for searching in the digital libraries metadata. In addition, the user can <lb/>add items from the hyperlink activated search result list, e.g., a keyword or <lb/>an author name. On the left side of the UI relations between search queries are <lb/>displayed (TopicView) in form of a graph (TopicGraph). Each search topic is rep-<lb/>resented as a TopicBlob (node). Edges between the TopicBlobs show the number <lb/>documents common in both TopicBlobs. The TopicBlobs can be combined inter-<lb/>actively via drag-and-drop on boolean operators (AND, OR, NOT) included in <lb/>the TopicView (for details see [2]). For each TopicBlob in focus, subtopics can be <lb/>viewed and selected. The subtopics are automatically generated by dynamically <lb/>clustering the document abstracts using the Carrot 2 Clustering engine  1  . On the <lb/>right side of the UI the user can also switch to various views, supporting visual <lb/>analytics on the data. The bar chart view shows how many documents have been <lb/>published in a specific year for the selected topic. Depending on the curve that <lb/>the bar chart forms the user might be able to see if a research topic is a hot <lb/>topic or if few papers have been published on the topic lately. The user has also <lb/>

			<note place="footnote"> 1  http://project.carrot2.org/ <lb/></note>

			the possibility to use a heatmap view. The heatmap shows using a world map <lb/>the origin of the publications about the topic and how it emerged over time and <lb/>enables the user to see where in the world a research topic started and how it <lb/>spread. On the left side of the UI the user can switch to an author graph, show-<lb/>ing for a selected publication the author, the co-authors and the publications of <lb/>author and co-author and to navigate further. <lb/> 3 Information Extraction <lb/> The goal of IE in DiLiA is on one hand to support digital libraries in the pro-<lb/>cess of making available new material and on the other hand to support users <lb/>in interactively exploring the content. We have developed a Generalised Name <lb/>Recogniser (GNR) for identifying domain independent, fully automatically and <lb/>unsupervised, multi-word technical terms, cf. [3]. Processing only the abstracts <lb/>of the documents, the current prototype contains these technical terms as auto-<lb/>matically generated list of keywords. Based on the identification of TTs in the <lb/>whole document, we are currently working on unsupervised relation extraction <lb/>methods. The extracted relations can be used for advanced search and also serve <lb/>as basis for clustering similar relations/documents. For identifying the TTs  2  we <lb/>used the nominal group (NG) chunker of the GNR, but the output was modified. <lb/>For example, coordinated phrases had to be split or text in parenthesis had to be <lb/>processed separately [3]. Since not every NG is a TT, we needed to find a way to <lb/>filter the NGs. Inspired by Luhn&apos;s findings [4], who suggested that mid-frequency <lb/>terms are the ones that best indicate the topic of a document, frequency scores <lb/>for all NGs using the Live Search API from Microsoft are retrieved. The NGs <lb/>are then filtered using an upper and lower threshold. We found out that the <lb/>upper threshold is domain dependent. For computer science documents the best <lb/>F-measure was achieved with a threshold of 20 mio., for biology 6.5 mio. The <lb/>extracted TTs serve as the basis for relation extraction. <lb/> Fig. 2. Information extraction data flow in DiLiA <lb/> 
			
			<note place="footnote">2  An evaluation on a hand-annotated computer science corpus (DBLP) showed that <lb/>68.2% of the NGs were identified completely and 31.3% partially (caused by missing <lb/>prepositional postmodifiers, additional premodifiers and appositive constructions). <lb/></note> 
			
			Fig. 2 shows the information flow. For the IE process all documents are first <lb/>split into sentences. The identified TTs are then replaced in each sentence with <lb/>a termID. Three different binary relation strategies have been implemented and <lb/>are currently being evaluated. The first strategy &quot; surface patterns &quot; is inspired by <lb/>[5] and uses the following pattern &lt;TermID1&gt;string&lt;TermID2&gt; to match each <lb/>sentence against. For &quot; Verb relations &quot; and &quot; Skeletons &quot; the modified sentences <lb/>are parsed with the Stanford Parser with dependency tree output. In the &quot; Verb <lb/>relation &quot; IE method the verb node and direct neighbour nodes containing TTs <lb/>are extracted. In the &quot; Skeleton &quot; approach [6] the relation consists of information <lb/>collected by going up the dependency tree starting from pairs of TTs and ending <lb/>at a common root node. <lb/> 4 Conclusion and Future Work <lb/> In this paper we presented the DiLiA demonstrator  3  , which provides a novel user <lb/>interface for interactively navigating in a digital library database. The system <lb/>also integrates IE methods (automatic extraction of technical terms and binary <lb/>relations). Currently, we are working on the implementation of the DiLiA system <lb/>for a Touchmaster touch table (2 x 1.10m) and investigate clustering algorithms <lb/>for very large data sets. <lb/>
		
		</body>

		<back>

			<div type="acknowledgement"> Acknowledgment <lb/> The research project DiLiA is co-funded by the European Regional Development <lb/>Fund (ERDF) in context of Investitionsbank Berlin&apos;s ProFIT program under <lb/>grant number 10140159. We gratefully acknowledge this support. <lb/></div>

			<listBibl> References <lb/> 1. Marchionini, G.: Information-seeking strategies of novices using a full-text electronic <lb/>encyclopedia. J. Am. Soc. Inf. Sci. 40(1) (1989) 54 – 66 <lb/>2. Seifert, I., Kruppa, M.: A pool of topics: Interactive relational topic visualization <lb/>for information discovery. In Huang, M.L., Nguyen, Q.V., Zhang, K., eds.: Visual <lb/>Information Communication, Springer (2010) <lb/>3. Eichler, K., Hemsen, H., Neumann, G.: Unsupervised and domain-independent <lb/>extraction of technical terms from scientific articles in digital libraries. In Mandl, <lb/>T., Frommholz, I., eds.: Proc. of the Workshop &quot;Information Retrieval&quot;, organized <lb/>as part of LWA, Darmstadt, Germany (21-23 September 2009) <lb/>4. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research <lb/>and Development 2 (1958) 157–165 <lb/>5. Yan, Y., Okazaki, N., Matsuo, Y., Yang, Z., Ishizuka, M.: Unsupervised Relation <lb/>Extraction by Mining Wikipedia Texts Using Information from theWeb. In: Proc. <lb/>of ACL and the 4th IJCNLP of the AFNLP, Suntec, Singapore (2009) 1021–1029 <lb/>6. Wang, R., Neumann, G.: Recognizing textual entailment using a subsequence kernel <lb/>method. In: Proc. of AAAI-2007, Vancoucer, Canada (2007) <lb/></listBibl>

			<note place="footnote"> 3  Online accessible via: http://dilia.b.dfki.de/ </note>

		</back>
	</text>
</tei>