Full metadata

Title

Semantic feature extraction for narrative analysis

Description

A story is defined as "an actor(s) taking action(s) that culminates in a resolution(s)''. I present novel sets of features to facilitate story detection among text via supervised classification and further reveal different forms within stories via unsupervised clustering. First, I investigate the utility of a new set of semantic features compared to standard keyword features combined with statistical features, such as density of part-of-speech (POS) tags and named entities, to develop a story classifier. The proposed semantic features are based on triplets that can be extracted using a shallow parser. Experimental results show that a model of memory-based semantic linguistic features alongside statistical features achieves better accuracy. Next, I further improve the performance of story detection with a novel algorithm which aggregates the triplets producing generalized concepts and relations. A major challenge in automated text analysis is that different words are used for related concepts. Analyzing text at the surface level would treat related concepts (i.e. actors, actions, targets, and victims) as different objects, potentially missing common narrative patterns. The algorithm clusters triplets into generalized concepts by utilizing syntactic criteria based on common contexts and semantic corpus-based statistical criteria based on "contextual synonyms''. Generalized concepts representation of text (1) overcomes surface level differences (which arise when different keywords are used for related concepts) without drift, (2) leads to a higher-level semantic network representation of related stories, and (3) when used as features, they yield a significant (36%) boost in performance for the story detection task. Finally, I implement co-clustering based on generalized concepts/relations to automatically detect story forms. Overlapping generalized concepts and relationships correspond to archetypes/targets and actions that characterize story forms. I perform co-clustering of stories using standard unigrams/bigrams and generalized concepts. I show that the residual error of factorization with concept-based features is significantly lower than the error with standard keyword-based features. I also present qualitative evaluations by a subject matter expert, which suggest that concept-based features yield more coherent, distinctive and interesting story forms compared to those produced by using standard keyword-based features.

Date Created

2016

Contributors

Ceran, Saadet Betul (Author)
Davulcu, Hasan (Thesis advisor)
Corman, Steven R. (Committee member)
Shakarian, Paulo (Committee member)
Ye, Jieping (Committee member)
Arizona State University (Publisher)

Topical Subject

Resource Type

Text

Genre

Doctoral Dissertation

Academic theses

Extent

vi, 66 pages : illustrations (some color)

Language

eng

Copyright Statement

In Copyright

Reuse Permissions

Primary Member of

ASU Electronic Theses and Dissertations

Peer-reviewed

No

Open Access

No

Handle

https://hdl.handle.net/2286/R.I.40243

Embargo Release Date

Wed, 08/01/2018 - 02:34

Statement of Responsibility

by Saadet Betul Ceran

Description Source

Viewed on November 8, 2016

Level of coding

full

Note

thesis

Partial requirement for: Ph.D., Arizona State University, 2016

bibliography

Includes bibliographical references (pages 62-66)

Field of study: Computer science

System Created

2016-10-12 02:17:32

System Modified

2021-08-30 01:21:36
3 years 3 months ago

Additional Formats