Building an Intuitive NLP Model for a Trillion Dollar Co.
Natural Language Processing
Problem Statement
The recent state-of-the-art natural language understanding (NLU) systems often behave unpredictably, failing on simpler reasoning examples. Despite this, there has been limited focus on quantifying progress towards systems with more predictable behaviour. The hypothesis of the project was that a behavioural summary of model performance will form the basis for quantifying the predictability of models and help humans form a holistic intuition.
Approach
We create a CHECKLIST test-suite (184K examples) for the Natural Language Inference (NLI) task, a representative NLU task and benchmark state-of-the-art NLI systems on this test-suite, which reveals fine-grained insights into the reasoning abilities of BERT and RoBERTa. We extend the existing tool by Ribeiro et al for generating examples from Knowledge and Implicature category as well.
Key challenges
[1] One of the major key challenge was to create a taxonomy of capabilities that a behavioural assessment should comprise of. We surveyed the current literature taking inspiration from multiple papers to arrive at a multi-level taxonomy with 17 high-level capabilities.
[2] Another challenge was to evaluate understand why models show inter-template variation in accuracy. This involved disentangling different biases in the dataset and understand their correlation with accuracy. For this we created a probing examples for select templates varying gender, occupation, person names from different countries etc.
Transform your operations, insights, and customer experiences with AI.