The challenges of German archival document categorization on insufficient labeled data

Danilo Dessi';
2020-01-01

Abstract

Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.
2020
Inglese
WHiSe 2020 Workshop on Humanities in the Semantic Web 2020
CEUR-WS
2695
15
20
6
3rd Workshop on Humanities in the Semantic Web, WHiSe 2020
Esperti anonimi
2 June 2020
Heraklion, Greece (Virtual)
scientifica
Cultural Heritage; Dataless Categorization; Document Exploration; Text Categorization
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
Hoppe, Fabian; Tietz, Tabea; Dessi', Danilo; Meyer, Nils; Sprau, Mirjam; Alam, Mehwish; Sack, Harald
273
7
4.1 Contributo in Atti di convegno
open
info:eu-repo/semantics/conferencePaper
File in questo prodotto:
File Dimensione Formato  
2020 - The Challenges of German Archival Document Categorization on Insufficient Labeled Data.pdf

accesso aperto

Tipologia: versione editoriale
Dimensione 259.07 kB
Formato Adobe PDF
259.07 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Questionario e social

Condividi su:
Impostazioni cookie