LanSAR – Language-commanded Scene-aware Action Response

Robot motion and control remains a complex problem both in general and inthe field of machine learning (ML). Without ML approaches, robot controllers are typically designed manually, which can take considerable time, generally requiring accounting for a range of edge cases and often producing models highly constrained to specific tasks. ML can decrease the time it takes to create a model while simultaneously allowing it to operate on a broader range of tasks. The utilization of neural networks to learn from demonstration is, in particular, an approach with growing popularity due to its potential to quickly fit the parameters of a model to mimic training data. Many such neural networks, especially in the realm of transformer-based architectures, act more as planners, taking in an initial context and then generating a sequence from that context one step at a time. Others hybridize the approach, predicting a latent plan and conditioning immediate actions on that plan. Such approaches may limit a model’s ability to interact with a dynamic environment, needing to replan to fully update its understanding of the environmental context. In this thesis, Language-commanded Scene-aware Action Response (LanSAR) is proposed as a reactive transformer-based neural network that makes immediate decisions based on previous actions and environmental changes. Its actions are further conditioned on a language command, serving as a control mechanism while also narrowing the distribution of possible actions around this command. It is shown that LanSAR successfully learns a strong representation of multimodal visual and spatial input, and learns reasonable motions in relation to most language commands. It is also shown that LanSAR can struggle with both the accuracy of motions and understanding the specific semantics of language commands
Learning Predictive Models for Assisted Human Biomechanics

This dissertation explores the use of artificial intelligence and machine learningtechniques for the development of controllers for fully-powered robotic prosthetics. The aim of the research is to enable prosthetics to predict future states and control biomechanical properties in both linear and nonlinear fashions, with a particular focus on ergonomics. The research is motivated by the need to provide amputees with prosthetic devices that not only replicate the functionality of the missing limb, but also offer a high level of comfort and usability. Traditional prosthetic devices lack the sophistication to adjust to a user’s movement patterns and can cause discomfort and pain over time. The proposed solution involves the development of machine learning-based controllers that can learn from user movements and adjust the prosthetic device’s movements accordingly. The research involves a combination of simulation and real-world testing to evaluate the effectiveness of the proposed approach. The simulation involves the creation of a model of the prosthetic device and the use of machine learning algorithms to train controllers that predict future states and control biomechanical properties. The real- world testing involves the use of human subjects wearing the prosthetic device to evaluate its performance and usability. The research focuses on two main areas: the prediction of future states and the control of biomechanical properties. The prediction of future states involves the development of machine learning algorithms that can analyze a user’s movements and predict the next movements with a high degree of accuracy. The control of biomechanical properties involves the development of algorithms that can adjust the prosthetic device’s movements to ensure maximum comfort and usability for the user. The results of the research show that the use of artificial intelligence and machine learning techniques can significantly improve the performance and usability of pros- thetic devices. The machine learning-based controllers developed in this research are capable of predicting future states and adjusting the prosthetic device’s movements in real-time, leading to a significant improvement in ergonomics and usability. Overall, this dissertation provides a comprehensive analysis of the use of artificial intelligence and machine learning techniques for the development of controllers for fully-powered robotic prosthetics.
Towards Reliable Semantic Vision

Models that learn from data are widely and rapidly being deployed today for real-world use, and have become an integral and embedded part of human lives. While these technological advances are exciting and impactful, such data-driven computer vision systems often fail in inscrutable ways. This dissertation seeks to study and improve the reliability of machine learning models from several perspectives including the development of robust training algorithms to mitigate the risks of such failures, construction of new datasets that provide a new perspective on capabilities of vision models, and the design of evaluation metrics for re-calibrating the perception of performance improvements. I will first address distribution shift in image classification with the following contributions: (1) two methods for improving the robustness of image classifiers to distribution shift by leveraging the classifier's failures into an adversarial data transformation pipeline guided by domain knowledge, (2) an interpolation-based technique for flagging out-of-distribution samples, and (3) an intriguing trade-off between distributional and adversarial robustness resulting from data modification strategies. I will then explore reliability considerations for \textit{semantic vision} models that learn from both visual and natural language data; I will discuss how logical and semantic sentence transformations affect the performance of vision--language models and my contributions towards developing knowledge-guided learning algorithms to mitigate these failures. Finally, I will describe the effort towards building and evaluating complex reasoning capabilities of vision--language models towards the long-term goal of robust and reliable computer vision models that can communicate, collaborate, and reason with humans.
Generating Natural Language Descriptions from Multimodal Data Traces of Robot Behavior

Natural Language plays a crucial role in human-robot interaction as it is the common ground where human beings and robots can communicate and understand each other. However, most of the work in natural language and robotics is majorly on generating robot actions using a natural language command, which is a unidirectional way of communication. This work focuses on the other direction of communication, where the approach allows a robot to describe its actions from sampled images and joint sequences from the robot task. The importance of this work is that it utilizes multiple modalities, which are the start and end images from the robot task environment and the joint trajectories of the robot arms. The fusion of different modalities is not just about fusing the data but knowing what information to extract from which data sources in such a way that the language description represents the state of the manipulator and the environment that it is performing the task on. From the experimental results of various simulated robot environments, this research demonstrates that utilizing multiple modalities improves the accuracy of the natural language description, and efficiently fusing the modalities is crucial in generating such descriptions by harnessing most of the various data sources.
Multimodal Robot Learning for Grasping and Manipulation

Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled engineers. Changing the robots’ behavior to cover new duties or handle variability is an expensive, complex, and time-consuming process. However, with the advent of more complex sensors and algorithms, overcoming these limitations becomes within reach. This work proposes innovations in artificial intelligence, language understanding, and multimodal integration to enable next-generation grasping and manipulation capabilities in autonomous robots. The underlying thesis is that multimodal observations and instructions can drastically expand the responsiveness and dexterity of robot manipulators. Natural language, in particular, can be used to enable intuitive, bidirectional communication between a human user and the machine. To this end, this work presents a system that learns context-aware robot control policies from multimodal human demonstrations. Among the main contributions presented are techniques for (a) collecting demonstrations in an efficient and intuitive fashion, (b) methods for leveraging physical contact with the environment and objects, (c) the incorporation of natural language to understand context, and (d) the generation of robust robot control policies. The presented approach and systems are evaluated in multiple grasping and manipulation settings ranging from dexterous manipulation to pick-and-place, as well as contact-rich bimanual insertion tasks. Moreover, the usability of these innovations, especially when utilizing human task demonstrations and communication interfaces, is evaluated in several human-subject studies.
Probabilistic Imitation Learning for Spatiotemporal Human-Robot Interaction

Imitation learning is a promising methodology for teaching robots how to physically interact and collaborate with human partners. However, successful interaction requires complex coordination in time and space, i.e., knowing what to do as well as when to do it. This dissertation introduces Bayesian Interaction Primitives, a probabilistic imitation learning framework which establishes a conceptual and theoretical relationship between human-robot interaction (HRI) and simultaneous localization and mapping. In particular, it is established that HRI can be viewed through the lens of recursive filtering in time and space. In turn, this relationship allows one to leverage techniques from an existing, mature field and develop a powerful new formulation which enables multimodal spatiotemporal inference in collaborative settings involving two or more agents. Through the development of exact and approximate variations of this method, it is shown in this work that it is possible to learn complex real-world interactions in a wide variety of settings, including tasks such as handshaking, cooperative manipulation, catching, hugging, and more.
Safe and Robust Cooperative Algorithm for Connected Autonomous Vehicles

Autonomous Vehicles (AVs) have the potential to significantly evolve transportation. AVs are expected to make transportation safer by avoiding accidents that happen due to human errors. When AVs become connected, they can exchange information with the infrastructure or other Connected Autonomous Vehicles (CAVs) to efficiently plan their future motion and therefore, increase the road throughput and reduce energy consumption. Cooperative algorithms for CAVs will not be deployed in real life unless they are proved to be safe, robust, and resilient to different failure models. Since intersections are crucial areas where most accidents happen, this dissertation first focuses on making existing intersection management algorithms safe and resilient against network and computation time, bounded model mismatches and external disturbances, and the existence of a rogue vehicle. Then, a generic algorithm for conflict resolution and cooperation of CAVs is proposed that ensures the safety of vehicles even when other vehicles suddenly change their plan. The proposed approach can also detect deadlock situations among CAVs and resolve them through a negotiation process. A testbed consisting of 1/10th scale model CAVs is built to evaluate the proposed algorithms. In addition, a simulator is developed to perform tests at a large scale. Results from the conducted experiments indicate the robustness and resilience of proposed approaches.
Optimization Based Verification and Synthesis for Safe Autonomy

Autonomous systems should satisfy a set of requirements that guarantee their safety, efficiency, and reliability when working under uncertain circumstances. These requirements can have financial, or legal implications or they can describe what is assigned to autonomous systems.As a result, the system controller needs to be designed in order to comply with these - potentially complicated - requirements, and the closed-loop system needs to be tested and verified against these requirements. However, when the complexity of the system and its requirements increases, designing a requirement-based controller for the system and analyzing the closed-loop system against the requirement becomes very challenging. In this case, existing design and test methodologies based on trial-and-error would fail, and hence disciplined scientific approaches should be considered. To address some of these challenges, in this dissertation, I present different methods that facilitate efficient testing, and control design based on requirements: 1. Gradient-based methods for improved optimization-based testing, 2. Requirement-based learning for the design of neural-network controllers, 3. Methods based on barrier functions for designing control inputs that ensure the satisfaction of safety constraints.
Learning Policies for Model-Based Reinforcement Learning Using Distributed Reward Formulation

This work explores combining state-of-the-art \gls{mbrl} algorithms focused on learning complex policies with large state-spaces and augmenting them with distributional reward perspective on \gls{rl} algorithms. Distributional \gls{rl} provides a probabilistic reward formulation as opposed to the classic \gls{rl} formulation which models the estimation of this distributional return. These probabilistic reward formulations help the agent choose highly risk-averse actions, which in turn makes the learning more stable. To evaluate this idea, I experiment in simulation on complex high-dimensional environments when subject under different noisy conditions.
Augmenting Academic Research Search And Reading With Richer Context

The volume of scientific research is growing at an exponential rate over the past100 years. With the advent of the internet and ubiquitous access to the web, academic research search engines such as Google Scholar, Microsoft Academic, etc., have become the go-to platforms for systemic reviews and search. Although many academic search engines host lots of content, they provide minimal context about where the search terms matched. Many of these search engines also fail to provide additional tools which can help enhance a researcher’s understanding of research content outside their respective websites. An example of such a tool can be a browser extension/plugin that surfaces context-relevant information about a research article when the user reads a research article. This dissertation discusses a solution developed to bring more intrinsic characteristics of research documents such as the structure of the research document, tables in the document, the keywords associated with the document to improve search capabilities and augment the information a researcher may read. The prototype solution named Sci-Genie( is a search engine over scientific articles from Computer Science ArXiv. Sci-Genie parses research papers and indexes research documents’ structure to provide context-relevant information about the matched search fragments. The same search engine also powers a browser extension to augment the information about a research article the user may be reading. The browser extension augments the user’s interface with information about tables from the cited papers, other papers by the same authors, and even the citations to and from the current article. The browser extension is further powered with access endpoints that leverage a machine learning model to filter tables comparing various entities. The dissertation further discusses these machine learning models and some baselines that help classify whether a table is comparing various entities or not. The dissertation finally concludes by discussing the current shortcomings of Sci-Genie and possible future research scope based on learnings after building Sci-Genie.
