Guide Labs Team

Guide Labs is a product-focused research company, and our goal is to build a new class of interpretable AI systems that humans and domain experts can reliably understand, steer, and debug.

We have assembled a team that has more than 20 years of experience focused on the interpretability and reliability of AI systems. We have published more than two dozen papers at top machine learning venues. Critically, we have shown that machine learning models trained solely for narrow performance measures, without regard for interpretability, result in models whose explanations are mostly unrelated to the model’s decision-making process, and are not aligned with humans for consequential decisions. Even worse, explanations of unchecked models can actively mislead. More recently, we’ve shown that self-explanations, like chain-of-thought, of LLMs are unreliable.

These results directly inform our approach to engineer AI models that are interpretable, reliable, and trustworthy. Toward this end, we have demonstrated the effectiveness of rethinking a model’s training process for language models and protein property prediction. We developed one of the first image-generative models at the billion-parameter scale that is constrained to reliably explain its outputs in terms of human-understandable factors. More recently, we demonstrated that billion-parameter language models can also be trained to be interpretable.

Our past experience has shown that it is crucial to integrate interpretability, safety, and reliability constraints as part of the model development pipeline, and that these constraints can be satisfied without compromising downstream performance. With the new AI systems we are building, we can more easily identify the causes of erroneous outputs, detect when models latch onto spurious signals, and correct the models effectively. We aim to create a world where domain experts shift from merely ‘prompting’ AI to engaging in meaningful and truthful dialogue with AI systems.

Our team’s work on engineering AI systems to be interpretable and reliable.

Here we give a brief overview of a selection of our team’s previous work.

Concept Bottleneck Generative Models, ICLR 2023.
- For the first time, we show how to train large-scale generative models that are constrained to explain their outputs in terms of human understandable factors at the billion parameter scale.
Concept Bottleneck Language Models for Protein Design, ICLR 2025.
- We demonstrate how to train large language models that are constrained to explain their outputs in terms of human understandable factors for protein design. For the first time, biochemists and other drug discovery experts can enable fine-grained control of a protein-language model for antibody and protein design.
Faithfulness Measurable Masked Language Models, ICML 2024.
- We present a method for ensuring that the explanations of masked language models are reliable.
Interpretability Needs a New Paradigm, Position Paper, 2024.
- We describe our perspective on how to engineer and train models so that they can have truthful and reliable explanations.
Error Discovery by Clustering Influence Embeddings, NeuRIPS, 2023.
- We demonstrate a technique for identifying groups of errors that a model is making.
Interpretable Mixture of Experts, TMLR, 2023.
- We show how to make the mixture of experts, a classic machine learning approach, constrained to be interpretable.
Improving Deep Learning Interpretability by Saliency Guided Training, NeuRIPS, 2022.
- We present an approach for training deep learning models that are constrained to ignore input features, in a dataset, that have ‘noisy’ gradients. Models that latch onto high frequency signals or that have noisy gradients have unreliable explanations.
Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics, ICLR, 2023.
- We present an approach to attribute a model’s bias to its training data.

Guide Labs Team

Get notified when you can start using Guide Labs