The challenges of German archival document categorization on insufficient labeled data

Danilo Dessi';
2020-01-01

Abstract

Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.
2020
Cultural Heritage; Dataless Categorization; Document Exploration; Text Categorization
Files in This Item:
File Size Format  
2020 - The Challenges of German Archival Document Categorization on Insufficient Labeled Data.pdf

open access

Type: versione editoriale
Size 259.07 kB
Format Adobe PDF
259.07 kB Adobe PDF View/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Questionnaire and social

Share on:
Impostazioni cookie