Full metadata

Title

The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

Description

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (https://github.com/diging/).

Date Created

2017-09-28

Contributors

Lessios-Damerow, Julia (Contributor)
Peirson, Erick (Contributor)
Laubichler, Manfred (Contributor)
ASU-SFI Center for Biosocial Complex Systems (Contributor)

Resource Type

Text

Extent

5 pages

Language

eng

Copyright Statement

In Copyright

Reuse Permissions

Attribution

Primary Member of

ASU Scholarship Showcase

Identifier

Digital object identifier: 10.5334/jors.164

International standard serial number

2049-9647

Peer-reviewed

No

Open Access

No

Series

JOURNAL OF OPEN RESEARCH SOFTWARE

Handle

https://hdl.handle.net/2286/R.I.46475

Preferred Citation

Damerow, J., Peirson, B. R., & Laubichler, M. D. (2017). The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents. Journal of Open Research Software, 5. doi:10.5334/jors.164

Level of coding

minimal

Cataloging Standards

asu1

Note

The final version of this article, as published in The Journal of Open Research, can be viewed online at: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.164/

System Created

2018-02-14 03:57:30

System Modified

2021-06-17 01:01:05
3 years 7 months ago

Additional Formats