CISUC

LemPORT: a High-Accuracy Cross-Platform Lemmatizer for Portuguese

Authors

Abstract

Although lemmatization is a very common subtask in many natural language processing tasks, there is a lack of available true cross-platform lemmatization tools specifically targeted for Portuguese, namely for integration in projects developed in Java. To address this issue, we have developed a lemmatizer, initially just for our own use, but which we have decided to make publicly available. The lemmatizer, presented in this document, yields an overall accuracy over 98% when compared against a manually revised corpus.

Keywords

Lemmatization, Normalization, Rules, Lexicon

Subject

Lemmatizer

Related Project

iCIS - Intelligent Computing in the Internet of Services

Conference

3rd Symposium on Languages, Applications and Technologies (SLATE’14), June 2014

DOI


Cited by

Year 2019 : 2 citations

 Cançado, M., Amaral, L., Amorin, E., Veloso, A., and Mello, H. (2019). Subjetividade em correções de redações: detecção automática através de léxico de operadores de viés linguístico. Even3 Publicações. preprint.

 Sergeevich, S. D. (2019). Information technology for processing of natural-language texts based on the integrational approach. Master’s thesis, Igor Sikorsky Kyiv Polytechnic Institute.

Year 2018 : 5 citations

 Sousa, L., de Mello, R., Cedrim, D., Garcia, A., Missier, P., Ucha, A., Oliveira, A., and Romanovsky, A. (2018). Vazadengue: An information system for preventing and combating mosquito- borne diseases with social networks. Information Systems, 75:26–42.

 Hachaj, T. and Ogiela, M. R. (2018). What can be learned from bigrams analysis of messages in social network? In Proceedings of 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI).

 de Barcelos Silva, A. (2018). O uso de recursos linguísticos para mensurar a semelhança semântica entre frases curtas através de uma abordagem híbrida. Master’s thesis, Universidade do Vale do Rio dos Sinos (UNISINOS).

 Galhardi, L. B., de Barbosa, C. R. S. C., Neto, J. C., and Brancher, J. D. (2018). Analisador léxico-morfológico de redações de estudantes no estilo do ENEM. Nuevas Ideas en Informática Educativa, 14:509–513.

 Gamallo, P. and Pereira-Fariña, M. (2018). Explorando métodos non-supervisados para calcular a similitude semántica textual. Linguamática, 10(2):63–68.

Year 2017 : 1 citations

 Devezas, J. and Nunes, S. (2017). Information Extraction for Event Ranking. In 6th Symposium on Languages, Applications and Technologies (SLATE 2017), volume 56 of OASIcs, pages 18:1–18:14, Dagstuhl, Germany. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

Year 2016 : 4 citations

 Wijaya, D. T. and Mitchell, T. (2016). Mapping verbs in different languages to knowledge base relations using web text as interlingua. In Proceedings of 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA.

 Wijaya, D. T. (2016). VerbKB: A Knowledge Base of Verbs for Natural Language Understanding. PhD thesis, Carnegie Mellon University.

 de Almeida, H. M. C. (2016). Suffix identification in portuguese using transducers. Master’s thesis, Instituto Superior Técnico.

 Hachaj, T. and Ogiela, M. R. (2016). Clusters of trends detection in microblogging: Simple natural language processing vs hashtags–which is more informative? In Proceedings of 10th International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS), pages 119–121. IEEE.