Examining the usage of New NLP Techniques to
process Raw Text Data Entries
2018, Google researchers published the BERT (Bidirectional Encoder Representations from Transformers) model, which has since served as a starting point for hundreds of NLP (Natural Language Processing) related experiments and other derivative models. BERT was trained on masked-language modelling (sentence prediction) but its capabilities extend to more common NLP tasks, such as language inference and text classification. Naralytics is a company that seeks to use natural language in order to be able to categorize users who create text into multiple categories – which is a modified version of classification. However, the text that Naralytics seeks to pull from exceed the maximum token length of 512 tokens that BERT supports – so this report discusses the research towards multiple BERT derivatives that seek to address this problem – and then implements a solution that addresses the multiple concerns that are attached to this kind of model.
- Author (aut): Ngo, Nicholas
- Thesis director: Carter, Lynn
- Committee member: Lee, Gyou-Re
- Contributor (ctb): Barrett, The Honors College
- Contributor (ctb): Computer Science and Engineering Program
- Contributor (ctb): Economics Program in CLAS