Newer
Older
grobid-corpus / segmentation / public / tei / W00-0734.training.segmentation.tei.xml
@zeynalig zeynalig on 26 Apr 2017 14 KB initialisation des corpus
<?xml version="1.0" ?>
<tei>
	<teiHeader>
		<fileDesc xml:id="_W00-0734"/>
	</teiHeader>
	<text xml:lang="en">
			<front> In:  Proceedings of CoNLL-2000 and LLL-2000,  pages 154-156, Lisbon, Portugal, 2000. <lb/> Chunking with WPDV Models <lb/> Hans van Halteren <lb/>Dept. of Language and Speech, Univ. of Nijmegen <lb/>P.O. Box 9103, 6500 HD Nijmegen <lb/>The Netherlands <lb/> hvh@let, <lb/> kun. nl <lb/></front>

			<body> 1 Introduction <lb/>In this paper I describe the application of the <lb/>WPDV algorithm to the CoNLL-2000 shared <lb/>task, the identification of base chunks in English <lb/>text (Tjong Kim Sang and Buchholz, 2000). For <lb/>this task, I use a three-stage architecture: I <lb/>first run five different base chunkers, then com-<lb/>bine them and finally try to correct some recur-<lb/>ring errors. Except for one base chunker, which <lb/>uses the memory-based machine learning sys-<lb/>tern TiMBL, 1 all modules are based on WPDV <lb/>models (van Halteren, 2000a). <lb/>2 Architecture components <lb/>The first stage of the chunking architecture con-<lb/>sists of five different base chunkers: <lb/> 1)  As a baseline, I use a stacked TiMBL <lb/>model. For the first level, following Daelemans <lb/>et al. (1999), I use as features all words and <lb/>tags in a window ranging from five tokens to <lb/>the left to three tokens to the right. For the <lb/>second level (cf. Tjong Kim Sang (2000)), I use <lb/>a smaller window, four left and two right, but <lb/>add the IOB suggestions made by the first level <lb/>for one token left and right (but not the focus). <lb/> 2)  The basic WPDV model uses as features <lb/>the words in a window ranging from one left to <lb/>one right, the tags in a window ranging from <lb/>three left to three right, and the IOB sugges-<lb/>tions for the previous two tokens? <lb/> 3)  In the reverse WPDV model, the direction <lb/>of chunking is reversed, i.e. it chunks from the <lb/>end of each utterance towards the beginning. <lb/>4) In the R&amp;M WPDV model, Ramshaw and <lb/>Marcus&apos;s type of IOB-tags are used, i.e. starts <lb/>of chunks are tagged with a B-tag only if the <lb/> 
			
			<note place="footnote">1Cf. http ://ilk. kub. nl/. <lb/></note>
			
			<note place="footnote">2For unseen data, i.e. while being applied, the IOB <lb/>suggestions used are of course those suggested by the <lb/>model itself, not the true ones. <lb/></note> 
			
			preceding chunk is of the same type, and with <lb/>an I-tag otherwise. <lb/> 5)  In the LOB WPDV model, the Penn word-<lb/>class tags (as produced by the Brill tagger) <lb/>are replaced by the output of a WPDV tagger <lb/>trained on 90% of the LOB corpus (van Hal-<lb/>teren, 2000b). <lb/>For all WPDV models, the number of fea-<lb/>tures is too high to be handled comfortably by <lb/>the current WPDV implementation. For this <lb/>reason, I use a maximum feature subset size of <lb/>four and a threshold frequency of two. 3 <lb/> The second stage consists of a combination of <lb/>the outputs of the five base chunkers, using an-<lb/>other WPDV model. Each chunker contributes <lb/>a feature containing the IOB suggestions for the <lb/>previous, current and next token. In addition, <lb/>there is a feature for the word and a feature <lb/>combining the (Penn-style) wordclass tags of <lb/>the previous, current and next token. For the <lb/>combination model, I use no feature restrictions, <lb/>and the default hill-climbing procedure. <lb/>In the final stage, I apply corrective mea-<lb/>sures to systematic errors which are observed <lb/>in the output of leave-one-out experiments on <lb/>the training data. For now, I focus on the most <lb/>frequent phrase type, the NP, and especially on <lb/>one weak point: determination of the start po-<lb/>sition of NPs. I use separate WPDV models for <lb/>each of the following cases: <lb/> 1)  Should a token now marked I-NP start a <lb/> 
			
			<note place="footnote">~Cf. van Halteren (2000a). Also, the difference be-<lb/>tween training and running (correct IOB-tags vs model <lb/>suggestions) leads to a low expected generalization qual-<lb/>ity of hill-climbing. I therefore stop climbing after a <lb/>single effective step, but using an alternative climbing <lb/>procedure, in which not only the single best multiplica-<lb/>tions/division is applied per step, but which during ev-<lb/>ery step applies all multiplications/divisions that yielded <lb/>improvements while the opposite operation did not. <lb/></note>

			<page> 154 <lb/></page>

			Phrase <lb/> type <lb/>ADJP <lb/>ADVP <lb/>CONJP <lb/>INTJ <lb/>LST <lb/>NP <lb/>PP <lb/>PRT <lb/>SBAR <lb/>VP <lb/>Number in <lb/>test set <lb/>438 <lb/>866 <lb/>9 <lb/>2 <lb/>5 <lb/>12422 <lb/>4811 <lb/>106 <lb/>535 <lb/>4658 <lb/>TiMBL <lb/>WPDV <lb/>basic <lb/>reverse R&amp;M <lb/>LOB <lb/>64.99 <lb/>71.14 <lb/>76.18 <lb/>70.52 <lb/>69.83 <lb/>74.55 <lb/>75.03 <lb/>78.96 <lb/>79.83 <lb/>78.16 <lb/>78.50 <lb/>80.09 <lb/>36.36 <lb/>45.45 <lb/>18.18 <lb/>20.69 <lb/>58.82 <lb/>42.11 <lb/>66.67 <lb/>66.67 <lb/>66.67 <lb/>66.67 <lb/>0.00 <lb/>66.67 <lb/>0.00 <lb/>0.00 <lb/>0.00 <lb/>0.00 <lb/>0.00 <lb/>0.00 <lb/>91.85 <lb/>92.65 <lb/>92.56 <lb/>92.00 <lb/>92.35 <lb/>93.72 <lb/>95.66 <lb/>96.53 <lb/>96.85 <lb/>96.06 <lb/>96.65 <lb/>97.09 <lb/>63.10 <lb/>73.63 <lb/>68.60 <lb/>74.07 <lb/>73.45 <lb/>74.31 <lb/>76.50 <lb/>82.27 <lb/>85.54 <lb/>84.18 <lb/>84.77 <lb/>85.41 <lb/>92.11 <lb/>92.80 <lb/>92.84 <lb/>92.37 <lb/>91.45 <lb/>93.61 <lb/> NO OtN <lb/>NOt A&apos;7 <lb/>O1 T ยข) <lb/>N1 ON <lb/> Combination <lb/>Corrective <lb/>measures <lb/>74.52 <lb/>79.86 <lb/>42.11 <lb/>66.67 <lb/>0.00 <lb/>93.84 <lb/>97.10 <lb/>74.31 <lb/>85.41 <lb/>93.65 <lb/> O~ OE <lb/> Qq  q&apos;~ <lb/> Table 1: FZ=i measurements for all systems (as described in the text). In addition we list the <lb/>number of occurrences of each phrase type in the test set. <lb/>new NP? 4 Features used: the wordclass tag se-<lb/>quence within the NP up to the current token, <lb/>the wordclass sequence within the NP from the <lb/>current token, and the current, previous and <lb/>next word within the NP. <lb/> 2)  Should a token now marked B-NP con-<lb/>tinue a preceding NP? Features used: type and <lb/>structure (in terms of wordclass tags) of the cur-<lb/>rent and the preceding two chunks, and the final <lb/>word of the current and the preceding chunk. <lb/> 3)  Should (part of) a chunk now preceding <lb/>an NP be part of the NP? Features used: type <lb/>and structure (in wordclass tags) of the current, <lb/>preceding and next chunk (the latter being the <lb/>NP), and the final word of the current and next <lb/>chunk. <lb/>For all three models, the number of different <lb/>features is large. Normally, this would force the <lb/>use of feature restrictions. The training sets are <lb/>very small, however, so that the need for feature <lb/>restrictions disappears and the full model can <lb/>be used. On the other hand, the limited size <lb/>of the training sets has as a disadvantage that <lb/>hill-climbing becomes practically useless. For <lb/>this reason, I do not use hill-climbing but simply <lb/>take the initial first order weight factors. <lb/>Each token is subjected to the appropriate <lb/>model, or, if not in any of the listed situations, <lb/>left untouched. To remove (some) resulting in-<lb/>consistencies, I let an AWK script then change <lb/>the IOB-tag of all comma&apos;s and coordinators <lb/>that now end an NP into O. <lb/> 
			
			<note place="footnote">4This cannot already be the first token of an NP, <lb/>as I-tags following a different type of chunk are always <lb/>immediately transformed to B-tags. <lb/></note>
			
			 3 Results <lb/> The Ff~=l scores for all systems are listed in Ta-<lb/>ble 1. They vary greatly per phrase type, partly <lb/>because of the relative difficulty of the tasks but <lb/>also because of the variation in the number of <lb/>relevant training and test cases: the most fre-<lb/>quent phrase types (NP, PP and VP) also show <lb/>the best results. Note that three of the phrase <lb/>types (CONJP, INTJ and LST) are too infre-<lb/>quent to yield statistically sensible information. <lb/>The TiMBL results are worse than the ones <lb/>reported by Buchholz et al. (1999), 5 but the lat-<lb/>ter were based on training on WSJ sections 00-<lb/>19 and testing on 20-24. When comparing with <lb/>the NP scores of Daelemans et al. (1999), we see <lb/>a comparable accuracy (actually slightly higher <lb/>because of the second level classification). <lb/>The WPDV accuracies are almost all much <lb/>higher. For NP, the basic and reverse model <lb/>produce accuracies which can compete with <lb/>the highest published non-combination accura-<lb/>cies so far. Interestingly, the reverse model <lb/>yields the best overall score. This can be ex-<lb/>plained by the observation that many choices, <lb/>e.g. PP/PRT and especially ADJP/part of NP, <lb/>are based mostly on the right context, about <lb/>which more information becomes available when <lb/>the text is handled from right to left. The <lb/>R&amp;M-type IOB-tags are generally less useful <lb/>than the standard ones, but still show excep-<lb/>tional quality for some phrase types, e.g. PRT. <lb/>The results for the LOB model are disappoint-<lb/>ing, given the overall quality of the tagger used <lb/>

			<note place="footnote"> ~FADJP----66.7,  FADVP----77.9  FNp=92.3, Fpp=96.8, <lb/> Fvp----91.8 <lb/></note>

			<page> 155 <lb/></page>

			test data precision <lb/> (97.82% on the held-out 10% of LOB). I hypoth-<lb/>esize this to be due to: a) differences in text <lb/>type between LOB and WSJ, b) partial incom-<lb/>patibility between the LOB tags and the WSJ <lb/>chunks and c) insufficiency of chunker training <lb/>set size for the more varied LOB tags. <lb/>Combination, as in other tasks (e.g. van Hal-<lb/>teren et al. (To appear)), leads to an impressive <lb/>accuracy increase, especially for the three most <lb/>frequent phrase types, where there is a suffi-<lb/>cient number of cases to train the combination <lb/>model on. There are only two phrase types, <lb/>ADVP and SBAR, where a base chunker (re-<lb/>verse WPDV) manages to outperform the com-<lb/>bination. In both cases the four normal direc-<lb/>tion base chunkers outvote the better-informed <lb/>reverse chunker, probably because the combina-<lb/>tion system has insufficient training material to <lb/>recognize the higher information value of the re-<lb/>verse model (for these two phrase types). Even <lb/>though the results are already quite good, I ex-<lb/>pect that even more effective combination is <lb/>possible, with an increase in training set size <lb/>and the inclusion of more base chunkers, espe-<lb/>cially ones which differ substantially from the <lb/>current, still rather homogeneous, set. <lb/>The corrective measures yield further im-<lb/>provement, although less impressive. Unsur-<lb/>prisingly, the increase is found mostly for the <lb/>NP. The next most affected phrase type is the <lb/>ADJP, which can often be joined with or re-<lb/>moved from the NP. There is an increase in re-<lb/>call for ADJP (71.23% to 71.46%), but a de-<lb/>crease in precision (78.20% to 77.86%), leav-<lb/>ing the FZ=I value practically unchanged. For <lb/>ADVP, there is a loss of accuracy, most likely <lb/>caused by the one-shot correction procedure. <lb/>This loss will probably disappear when a proce-<lb/>dure is used which is iterative and also targets <lb/>other phrase types than the NP. For VP, on the <lb/>other hand, there is an accuracy increase, prob-<lb/>ably due to a corrected inclusion/exclusion of <lb/>participles into/from NPs. The overall scores <lb/>show an increase, especially due to the per-type <lb/>increases for the very frequent NP and VP. <lb/>All scores for the chunking system as a whole, <lb/>including precision and recall percentages, are <lb/>listed in Table 2. For all phrase types, the <lb/>system yields substantially better results than <lb/>any previously published. I attribute the im-<lb/>provements primarily to the combination archi-<lb/>ADJP <lb/>ADVP <lb/>CONJP <lb/>INTJ <lb/>LST <lb/>NP <lb/>PP <lb/>PRT <lb/>SBAR <lb/>VP <lb/>77.86% <lb/>80.52% <lb/>40.00% <lb/>100.00% <lb/>O.00% <lb/>93.55% <lb/>96.43% <lb/>72.32% <lb/>87.77% <lb/>93.36% <lb/> all <lb/> 93.13% 93.51% <lb/> recall <lb/>Ff~=l <lb/>71.46% 74.52 <lb/>79.21% 79.86 <lb/>44.44% 42.11 <lb/>50.00% 66.67 <lb/>0.00% <lb/>0.00 <lb/>94.13% 93.84 <lb/>97.78% 97.10 <lb/>76.42% 74.31 <lb/>83.18% 85.41 <lb/>93.95% 93.65 <lb/>93.32 <lb/>Table 2: Final results per chunk type, i.e. af-<lb/>ter applying corrective measures to base chun-<lb/>ker combination. <lb/>tecture, with a smaller but yet valuable contri-<lb/>bution by the corrective measures. The choice <lb/>for WPDV proves a good one, as the WPDV <lb/>algorithm is able to cope well with all the mod-<lb/>eling tasks in the system. Whether it is the best <lb/>choice can only be determined by future experi-<lb/>ments, using other machine learning techniques <lb/>in the same architecture. <lb/></body>

			<listBibl>References <lb/>Sabine Buchholz, Jorn Veenstra, and Walter Daele-<lb/>mans. 1999. Cascaded grammatical relation as-<lb/>signment. In  Proceedings of EMNLP/VLC-99. <lb/> Association for Computational Linguistics. <lb/>W. Daelemans, S. Buchholz and J. Veenstra. 1999. <lb/>Memory-based shallow parsing. In  Proceedings of <lb/>CoNLL, Bergen, Norway. <lb/> H. van Halteren. 2000a. A default first order family <lb/>weight determination procedure for WPDV mod-<lb/>els. In  Proceedings of the CoNLL-2000.  Associa-<lb/>tion for Computational Linguistics. <lb/>H. van Halteren. 2000b. The detection of inconsis-<lb/>tency in manually tagged text. In  Proceedings of <lb/>LINC2000. <lb/> H. van Halteren, J. Zavrel, and W. Daelemans. To <lb/>appear. Improving accuracy in wordclass tagging <lb/>through combination of machine learning systems. <lb/> Computational Linguistics. <lb/> E. F. Tjong Kim Sang. 2000. Noun phrase recogni-<lb/>tion by system combination. In  Proceedings o] the <lb/>ANLP-NAACL 2000.  Seattle, Washington, USA. <lb/>Morgan Kaufman Publishers. <lb/>E. F. Tjong Kim Sang and S. Buchholz. 2000. Intro-<lb/>duction to the CoNLL-2000 shared task: Chunk-<lb/>ing. In  Proceedings of the CoNLL-2000.  Associa-<lb/>tion for Computational Linguistics. <lb/></listBibl>

			<page> 156 </page>


	</text>
</tei>