Full metadata
Title
The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
Description
In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (https://github.com/diging/).
Date Created
2017-09-28
Contributors
- Lessios-Damerow, Julia (Contributor)
- Peirson, Erick (Contributor)
- Laubichler, Manfred (Contributor)
- ASU-SFI Center for Biosocial Complex Systems (Contributor)
Resource Type
Extent
5 pages
Language
eng
Copyright Statement
In Copyright
Primary Member of
Identifier
Digital object identifier: 10.5334/jors.164
Identifier Type
International standard serial number
Identifier Value
2049-9647
Peer-reviewed
No
Open Access
No
Series
JOURNAL OF OPEN RESEARCH SOFTWARE
Handle
https://hdl.handle.net/2286/R.I.46475
Preferred Citation
Damerow, J., Peirson, B. R., & Laubichler, M. D. (2017). The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents. Journal of Open Research Software, 5. doi:10.5334/jors.164
Level of coding
minimal
Cataloging Standards
Note
The final version of this article, as published in The Journal of Open Research, can be viewed online at: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.164/
System Created
- 2018-02-14 03:57:30
System Modified
- 2021-06-17 01:01:05
- 3 years 4 months ago
Additional Formats