Description
In the standard pipeline for machine learning model development, several design decisions are made largely based on trial and error. Take the classification problem as an example. The starting point for classifier design is a dataset with samples from the

In the standard pipeline for machine learning model development, several design decisions are made largely based on trial and error. Take the classification problem as an example. The starting point for classifier design is a dataset with samples from the classes of interest. From this, the algorithm developer must decide which features to extract, which hypothesis class to condition on, which hyperparameters to select, and how to train the model. The design process is iterative with the developer trying different classifiers, feature sets, and hyper-parameters and using cross-validation to pick the model with the lowest error. As there are no guidelines for when to stop searching, developers can continue "optimizing" the model to the point where they begin to "fit to the dataset". These problems are amplified in the active learning setting, where the initial dataset may be unlabeled and label acquisition is costly. The aim in this dissertation is to develop algorithms that provide ML developers with additional information about the complexity of the underlying problem to guide downstream model development. I introduce the concept of "meta-features" - features extracted from a dataset that characterize the complexity of the underlying data generating process. In the context of classification, the complexity of the problem can be characterized by understanding two complementary meta-features: (a) the amount of overlap between classes, and (b) the geometry/topology of the decision boundary. Across three complementary works, I present a series of estimators for the meta-features that characterize overlap and geometry/topology of the decision boundary, and demonstrate how they can be used in algorithm development.
Reuse Permissions
  • Downloads
    PDF (7.2 MB)

    Details

    Title
    • Modeling and Exploiting the Structure of Data via Meta-Features for Robust and Efficient Machine Learning
    Contributors
    Date Created
    2022
    Subjects
    Resource Type
  • Text
  • Collections this item is in
    Note
    • Partial requirement for: Ph.D., Arizona State University, 2022
    • Field of study: Computer Engineering

    Machine-readable links