Περίληψη
Αυτή η διατριβή συνεισφέρει μια πρωτότυπη έρευνα στο πεδίο της εξαγωγής φράσεων κλειδιών. Η εξαγωγή φράσεων κλειδιών από κείμενα σχε search, document clustering, and classification. We focus on the unsupervised mode of the task. The basic steps of an unsupervised keyphrase extraction approach are the following. First, the method selects the candidate lexical units based on some heuristics (e.g., excluding stopwords and choosing words that belong to a specific part-of-speech). It then ranks the candidate lexical units and forms the keyphrases by selecting words from the top-ranked ones or selecting a phrase with a high-rank score or whose parts have a high score. The dissertati ...
This thesis presents original research in the keyphrase extraction field. Keyphrase extraction is a textual information processing task concerned with the automatic extraction of representative and characteristic phrases from a document that express all the critical aspects of its content. Keyphrases constitute a conceptual summary of a text document, which is very useful in digital information management systems for semantic indexing, faceted search, document clustering, and classification. We focus on the unsupervised mode of the task. The basic steps of an unsupervised keyphrase extraction approach are the following. First, the method selects the candidate lexical units based on some heuristics (e.g., excluding stopwords and choosing words that belong to a specific part-of-speech). It then ranks the candidate lexical units and forms the keyphrases by selecting words from the top-ranked ones or selecting a phrase with a high-rank score or whose parts have a high score. The dissertation's application domain is text; however, the thesis's contributions could easily be applied to other fields where graphs prevail as an information representation means, too. This thesis aims at a better understanding of the keyphrase extraction methods. We also propose an alternative representation (different from the widely used graph-of-words) and utilization of the target document's statistical information (other than the popular centrality measures). Furthermore, we contribute to several evaluation issues, such as assessing the impact of different evaluation measures, approaches, ground truth standards on the methods' performance evaluation results, and introducing new evaluation measures/approaches. Finally, we present a study on the evolution of Greek words using word embeddings. In this thesis, we give a well-organized presentation of the keyphrase extraction methods through informative categorization schemes, a list of popular keyphrase extraction datasets, commercial APIs, and free software related to keyphrase extraction. Then, we present a novel unsupervised method for keyphrase extraction, whose main innovation is the use of local word embeddings (employing the GloVe technique), i.e., embeddings trained from the single document under consideration. As this is the first time a local word vector representation is used in the keyphrase extraction task, we focus on the keyword extraction task to improve the individual words' scoring/ranking process. Next, we present a performance evaluation study for commercial APIs and state-of-the-art unsupervised keyphrase extraction methods with a more in-depth analysis of how the keyphrase extractors' performance results are affected by different evaluation measures, approaches, and ground truth standards. Finally, in the context of our interest in unsupervised keyphrase extraction from greek literature documents of the 19th-21st century using word vector representations, we start a study for the evolution of greek words via word embeddings.
περισσότερα