Search and classify topics in a corpus of text using the latent dirichlet allocation model

Iparraguirre-Villanueva, Orlando; Sierra-Liñan, Fernando; Herrera Salazar, Jose Luis; Beltozar-Clemente, Saul; Pucuhuayla-Revatta, Félix; Zapata-Paulini, Joselyn; Cabanillas-Carbonell, Michael

dc.contributor.author	Iparraguirre-Villanueva, Orlando
dc.contributor.author	Sierra-Liñan, Fernando
dc.contributor.author	Herrera Salazar, Jose Luis
dc.contributor.author	Beltozar-Clemente, Saul
dc.contributor.author	Pucuhuayla-Revatta, Félix
dc.contributor.author	Zapata-Paulini, Joselyn
dc.contributor.author	Cabanillas-Carbonell, Michael
dc.date.accessioned	2023-11-30T16:01:47Z
dc.date.available	2023-11-30T16:01:47Z
dc.date.issued	2023
dc.identifier.uri	https://hdl.handle.net/20.500.13067/2829
dc.description.abstract	This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.	es_PE
dc.format	application/pdf	es_PE
dc.language.iso	eng	es_PE
dc.publisher	Indonesian Journal of Electrical Engineering and Computer Science	es_PE
dc.rights	info:eu-repo/semantics/openAccess	es_PE
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	es_PE
dc.subject	Classify	es_PE
dc.subject	Discovering	es_PE
dc.subject	Latent dirichlet allocation	es_PE
dc.subject	Text corpus	es_PE
dc.subject	Topics	es_PE
dc.title	Search and classify topics in a corpus of text using the latent dirichlet allocation model	es_PE
dc.type	info:eu-repo/semantics/article	es_PE
dc.identifier.doi	https://doi.org/10.11591/ijeecs.v30.i1.pp246-256
dc.subject.ocde	https://purl.org/pe-repo/ocde/ford#2.02.04	es_PE
dc.source.volume	30	es_PE
dc.source.issue	1	es_PE
dc.source.beginpage	246	es_PE
dc.source.endpage	256	es_PE

Files in this item

Name:: 6_2023.pdf
Size:: 631.1Kb
Format:: application/pdf
Description:: Artículo

View/Open

This item appears in the following Collection(s)

Ingeniería de Sistemas [332]

Show simple item record

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess