Model-agnostic Coreset Selection via LLM-based Concept Bottlenecks

Dolby Laboratories, Inc.

Akshay Mehra*, Trisha Mittal*, Subhadra Gopalakrishnan, Joshua Kimball

Arxiv
CVPR 2025 Workshop on Visual Concepts

Coreset Selection (CS) aims to identify a subset of the training dataset that achieves model performance comparable to using the entire dataset. Many state-of-the-art CS methods select coresets using scores whose computation requires training the downstream model on the entire dataset first and recording changes in the model's behavior on samples as it trains (training dynamics). These scores are inefficient to compute and hard to interpret, as they do not indicate whether a sample is difficult to learn in general or only for a specific downstream model. Our work addresses these challenges by proposing a score that computes a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model. Specifically, we measure the alignment between a sample's visual features and concept bottlenecks, derived via large language models, by training a linear concept bottleneck layer and computing the sample's difficulty score using it. We then use stratified sampling based on this score to generate a coreset of the dataset. Crucially, our score is efficiently computable without training the downstream model on the full dataset even once, leads to high-performing coresets for various downstream models, and is computable even for an unlabeled dataset. Through experiments on CIFAR-10/100, and ImageNet-1K, we show that our coresets outperform random subsets, even at high pruning rates, and achieve model performance comparable to or better than coresets found by training dynamics-based methods.

Analysis of Human Perception in Distinguishing Real and AI-Generated Faces: An Eye-Tracking Based Study

Dolby Laboratories, Inc.

Jin Huang, Subhadra Gopalakrishnan, Trisha Mittal, Jake Zuena, Jaclyn Pytlarz

The 19th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2025)

Recent advancements in Artificial Intelligence have led to remarkable improvements in generating realistic human faces. While these advancements demonstrate significant progress in generative models, they also raise concerns about the potential misuse of these generated images. In this study, we investigate how humans perceive and distinguish between real and fake images. We designed a perceptual experiment using eye-tracking technology to analyze how individuals differentiate real faces from those generated by AI. Our analysis of StyleGAN-3 generated images reveals that participants can distinguish real from fake faces with an average accuracy of 76.80%. Additionally, we found that participants scrutinize images more closely when they suspect an image to be fake. We believe this study offers valuable insights into human perception of AI-generated media.

Assisted Inverse Reinforcement Learning

P. Kamalaruban, R. Devidze, T. Yeo, Trisha Mittal, V. Cevher, A. Singla

NeurIPS 2018 Workshop on Learning by Instruction

We study the problem of inverse reinforcement learning (IRL) with the added twist that the learner is assisted by a helpful teacher. More formally, we tackle the following question: How could a teacher provide an informative sequence of demonstrations to an IRL agent to speed up the learning process? We prove rigorous convergence guarantees of a new iterative teaching algorithm that adaptively chooses demonstrations based on the learner's current performance. Extensive experiments with a car driving simulator environment show that the learning progress can be speeded up drastically as compared to an uninformative teacher.

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

AAAI 2020

We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a per sample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.

EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's Principle

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

CVPR 2020

We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images. Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition. Our first interpretation is based on using multiple modalities (e.g. faces and gaits) for emotion recognition. For the second interpretation, we gather semantic context from the input image and use a self-attention-based CNN to encode this information. Finally, we use depth maps to model the third interpretation related to socio-dynamic interactions and proximity among agents. We demonstrate the efficiency of our network through experiments on EMOTIC, a benchmark dataset. We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods. We also introduce a new dataset, GroupWalk, which is a collection of videos captured in multiple real-world settings of people walking. We report an AP of 65.83 across 4 categories on GroupWalk, which is also an improvement over prior methods.

Multimodal and Context-Aware Emotion Perception Model with Multiplicative Fusion

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2022

We present a multimodal and context-aware emotion perception model that combines cues from multiple co-occurring modalities (such as face, text, and speech) and is robust to sensor noise in any of the individual modalities. Our approach uses a novel, data-driven multiplicative fusion method to combine the modalities, which learns to emphasize the more reliable cues and suppress others on a per sample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, our model is robust to sensor noise. We also generate proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on multiple benchmark datasets and report significant improvements over prior work.

Towards Determining Perceived Human Intent for Multimodal Social Media Posts using The Theory of Reasoned Action

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

Nature Scientific Reports 2024

Naturalistic Head Motion Generation from Speech

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2023

Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the variation in the perceptual quality of head motions sampled from a generative model. We show that, despite providing more diverse head motions, the generative model produces motions with varying degrees of perceptual quality. We finally show that objective metrics commonly used in previous research do not accurately reflect the perceptual quality of generated head motions. These results open an interesting avenue for future work to investigate better objective metrics that correlate with human perception of quality.

STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

AAAI 2017

We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the emotional state of the human into one of four emotions: happy, sad, angry, or neutral. We use hundreds of annotated real-world gait videos and augment them with thousands of annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of 2,177 human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 89% on E-Gait, which is 14 - 30% more accurate over prior methods.

Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

CVPR 2019

We present an autoencoder-based semi-supervised approach to classify perceived human emotions from walking styles obtained from videos or motion-captured data and represented as sequences of 3D poses. Given the motion on each joint in the pose at each time step extracted from 3D pose sequences, we hierarchically pool these joint motions in a bottom-up manner in the encoder, following the kinematic chains in the human body. We also constrain the latent embeddings of the encoder to contain the space of psychologically-motivated affective features underlying the gaits. We train the decoder to reconstruct the motions per joint per time step in a top-down manner from the latent embeddings. For the annotated data, we also train a classifier to map the latent embeddings to emotion labels. Our semi-supervised approach achieves a mean average precision of 0.84 on the Emotion-Gait benchmark dataset, which contains both labeled and unlabeled gaits collected from multiple sources. We outperform current state-of-art algorithms for both emotion recognition and action recognition from 3D gaits by 7%--23% on the absolute. More importantly, we improve the average precision by 10%--50% on the absolute on classes that each makes up less than 25% of the labeled part of the Emotion-Gait benchmark dataset.

Generating Emotive Gaits for Virtual Agents Using Affect-Based Autoregression

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

AAAI 2018

We present a novel autoregression network to generate virtual agents that convey various emotions through their walking styles or gaits. Given the 3D pose sequences of a gait, our network extracts pertinent movement features and affective features from the gait. We use these features to synthesize subsequent gaits such that the virtual agents can express and transition between emotions represented as combinations of happy, sad, angry, and neutral. We incorporate multiple regularizations in the training of our network to simultaneously enforce plausible movements and noticeable emotions on the virtual agents. We also integrate our approach with an AR environment using a Microsoft HoloLens and can generate emotive gaits at interactive rates to increase the social presence. We evaluate how human observers perceive both the naturalness and the emotions from the generated gaits of the virtual agents in a web-based study. Our results indicate around 89% of the users found the naturalness of the gaits satisfactory on a five-point Likert scale, and the emotions they perceived from the virtual agents are statistically similar to the intended emotions of the virtual agents.

CMetric: A Driving Behavior Measure using Centrality Functions

Rohan Chandra, Uttaran Bhattacharya, Trisha Mittal, Aniket Bera, Dinesh Manocha

IROS 2020

We present a new measure, CMetric, to classify driver behaviors using centrality functions. Our formulation combines concepts from computational graph theory and social traffic psychology to quantify and classify the behavior of human drivers. CMetric is used to compute the probability of a vehicle executing a driving style, as well as the intensity used to execute the style. Our approach is designed for realtime autonomous driving applications, where the trajectory of each vehicle or road-agent is extracted from a video. We compute a dynamic geometric graph (DGG) based on the positions and proximity of the road-agents and centrality functions corresponding to closeness and degree. These functions are used to compute the CMetric based on style likelihood and style intensity estimates. Our approach is general and makes no assumption about traffic density, heterogeneity, or how driving behaviors change over time.

Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in Graph-LSTMs

Rohan Chandra, Tianrui Guan, Srujan Panuganti, Trisha Mittal, Uttaran Bhattacharya, Aniket Bera, Dinesh Manocha

RAL/IROS 2020

We present a novel approach to predict trajectories of road-agents in dense traffic scenes using a combination of spectral clustering and deep learning. Our approach is designed for heterogeneous traffic, where different road agents such as cars, buses, pedestrians, two-wheelers, etc. follow different motion patterns. We model the interactions between road agents using a weighted dynamic geometric graph (DGG) and compute the spatio-temporal relationships between them using a novel algorithm based on spectral clustering. We use these clusters to classify the behavior of each road agent and combine them with a long short-term memory (LSTM) deep learning network to predict their trajectories. We evaluate the performance of our prediction algorithm, Spectral-LSTM, on the TRAF dataset and observe that it outperforms state-of-the-art methods by 30%.

GraphRQI: Classifying Driver Behaviors Using Graph Spectrums

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

AAAI 2016

We present a novel algorithm (GraphRQI) to identify driver behaviors from road-agent trajectories. Our approach assumes that the road-agents exhibit a range of driving traits, such as aggressive or conservative driving. Moreover, these traits affect the trajectories of nearby road-agents as well as the interactions between road-agents. We represent these inter-agent interactions using unweighted and undirected traffic graphs. Our algorithm classifies the driver behavior using a supervised learning algorithm by reducing the computation to the spectral analysis of the traffic graph. Moreover, we present a novel eigenvalue algorithm to compute the spectrum efficiently. We provide theoretical guarantees for the running time complexity of our eigenvalue algorithm and show that it is faster than previous methods by 2 times. We evaluate the classification accuracy of our approach on traffic videos and autonomous driving datasets corresponding to urban traffic. In practice, GraphRQI achieves an accuracy improvement of up to 25% over prior driver behavior classification algorithms.

Emotion's Don't Lie: A DeepFake Detection Method using Audio-Visual Affective Cues

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

AAAI 2021

Affect2MM: Affective Analysis of Multimedia Content Using Emotion Causality

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

CVPR 2021

BOhance: Bayesian Optimization for Content Enhancement

Trisha Mittal, Avinash Paliwal, Divyanshu Aggarwal, Sumit Shekhar

ACM Multimedia 2022

Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2023

Pictionary-style Word Guessing on Hand-drawn Object Sketches: Dataset, Analysis and Deep Network Models

Ravi Kiran Sarvadevabhatla, Shiv Surya, Trisha Mittal, R. Venkatesh Babu

AAAI 2018

S2MGen: A Synthetic Skin Mask Generator for Improving Segmentation

Dolby Laboratories, Inc.

Subhadra Gopalakrishnan, Trisha Mittal, Jaclyn Pytlarz, Yuheng Zhao

Synthetic Data for Computer Vision Workshop@ CVPR 2024
ISM 2024 (Oral)

Skin segmentation is an important and challenging task which finds use in direct applications such as image editing and indirect downstream tasks such as face detection or hand gesture recognition. However, the availability of diverse and high-quality training data is a major challenge. Annotation of dense segmentation masks is an expensive and time consuming process. Existing skin segmentation datasets are often limited in scope: they include downstream task-specific datasets captured under controlled conditions, with limited variability in lighting, scale, ethnicity, and age. This lack of diversity in the training data can lead to poor generalization and limited performance when applied to real-world images. To address this issue, we propose a tunable generation pipeline, Synthetic Skin Mask Generator (S2MGen), which allows for the creation of a diverse range of body positions, camera angles, and lighting conditions. We explore the impact of these tunable parameters on skin segmentation performance. We also show that improvements can be made to the performance and generalizability of models trained on real world datasets, by the inclusion of synthetic data in the training pipeline.

3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

AAAI 2023

A Logo-Based Approach for Recognising Multiple Products on a Shelf

Trisha Mittal, B. Laasya, J. Dinesh Babu

SAI Intellisys 2016 (Oral)