A Hybrid Multi-Word Terms Extraction System Applied to Topic Detection
Mutli-word Terms extraction plays an important role in many Natural Language Processing (NLP) tasks. Despite their major importance, few works were dedicated to Arabic multi-word terms extraction. This paper proposes an automatic Arabic multi-word terms (MWTs) extraction system based on two major filtering steps: linguistics filter using a part-of-speech tagger along with morphological patterns and statistical filter based on probabilistic methods, namely: Log-Likelihood Ratio (LLR) and C-value. We evaluate the performances of the realized systems on Wattan; an Arabic oriented topic newspaper corpus. Our system manages to achieve 90.23% in term of multi-word extraction precision. We also study the use of MWTs as features in Arabic Topic Detection. The conducted experiments show good results.
Authors retain the copyright of their manuscripts, and all Open Access articles are distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided that the original work is properly cited.