Stopwords identification by means of characteristic and discriminant analysis

ARMANO, GIULIANO;FANNI, FRANCESCA;GIULIANI, ALESSANDRO
2015-01-01

Abstract

Stopwords are meaningless, non-significant terms that frequently occur in a document. They should be removed, like a noise. Traditionally, two different approaches of building a stoplist have been used: the former considers the most frequent terms looking at a language (e.g., english stoplist), the other includes the most occurring terms in a document collection. In several tasks, e.g., text classification and clustering, documents are typically grouped into categories. We propose a novel approach aimed at automatically identifying specific stopwords for each category. The proposal relies on two unbiased metrics that allow to analyze the informative content of each term; one measures the discriminant capability and the latter measures the characteristic capability. For each term, the former is expected to be high in accordance with the ability to distinguish a category against others, whereas the latter is expected to be high according to how the term is frequent and common over all categories. A preliminary study and experiments have been performed, pointing out our insight. Results confirm that, for each domain, the metrics easily identify specific stoplist wich include classical and category-dependent stopwords.
2015
Inglese
ICAART 2015 - 7th International Conference on Agents and Artificial Intelligence, Proceedings
9789897580741
9897580743
SciTePress
2
353
360
8
7th International Conference on Agents and Artificial Intelligence, ICAART 2015
Contributo
Esperti anonimi
10-12 gennaio 2015
Lisbona
internazionale
scientifica
Characteristic capability; Discriminant capability; Stopwords; Text classification; Artificial Intelligence; Software
no
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
Armano, Giuliano; Fanni, Francesca; Giuliani, Alessandro
273
3
4.1 Contributo in Atti di convegno
reserved
info:eu-repo/semantics/conferencePaper
File in questo prodotto:
File Dimensione Formato  
2015-ICAART-armano.pdf

Solo gestori archivio

Tipologia: versione editoriale
Dimensione 341.98 kB
Formato Adobe PDF
341.98 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Questionario e social

Condividi su:
Impostazioni cookie