CATCH 2020 is a medium-scale research infrastructure that received funding by the Flemish Research Foundation (FWO) to develop an API that builds on the existing Transkribus platform in order to facilitate the computer-assisted transcription of more complex handwritten documents by the end of the year 2020.
Rather than producing flat transcripts of digital facsimile images (the default output of OCR and HTR engines), CATCH 2020 aims to produce structured texts by providing tools to add (1) textual and (2) linguistic dimensions to the transcription – thus combining the state of the art in textual scholarship with the state of the art in computational linguistics.
The infrastructure will:
- Provide tools for (semi-) automatically identifying textual features on the document (i.e. layout analysis, such as additions, deletions, or structural elements such as paragraphs or stanzas) (1).
- Use linguistic and stylistic information to improve Transkribus’s transcription algorithm. This combination will enable the automatic generation of qualitative, structured digitized text that may serve as a sound basis for further literary, linguistic and historical research. Led by ACDC (the Antwerp Centre for Digital humanities and literary Criticism), this is a multidisciplinary project that is developed in collaboration with CLiPS, GaP, ISLN, the CSG, and with external partners Transkribus and the Antwerp-based Flemish literary archive Letterenhuis (2).