To facilitate real-world applications of personal sensing such as, triggering a mobile intervention or detecting early depression onset, it is important to accurately predict an individual’s mood from passive data with limited or in some cases zero labels. In this project, we explore how pretrained models, otherwise known as foundation models, may be better suited for this scarce label setting compared to standard machine learning baselines.
Over the years, foundation models have been developed and tested for vision (MAE), text (BERT/GPT), audio (AudioMAE), tabular (TabPFN) and multi-modal datasets (M3MAE, CLIP).. More recently, we’ve begun to see the development of foundation models suitable for time-series (PatchTST, TTT, TFC, MOMENT) and wearable data (LSM, MAEEG, MaskFM). These models pre-trained on large amounts of data not only show improved performance over models trained from the scratch, but also perform better than existing models using in some cases < 1% of the available data.
While most methods have been benchmarked on commonly used multivariate time series classification domains (UEA), or human activity recognition, few if any have been benchmarked on predicting group-level or person-level mood. This is an important task both for its potential for better remote patient monitoring, but also as a test of the expressive power of these systems given typically sparsely available labels and intrinsically high heterogeneity between individuals.
We will compare our approach with strong baselines trained from scratch on depression datasets (SENSCODE, GrandChallenge, GLOBEM) and stress recognition (SNAPSHOT). To best adapt these to our setting, we also are exploring how to fine-tune or post-train these models using low rank matrix approximation and test-time training.