The Decision module

How we created an interface to enable internal ops teams to operate an AI model

AI powered quality estimation (QE) is the secret sauce to provide B2B translations at scale. Operating a QE model however was a task that often required input from specialists in the AI team.

This project describes how we went about creating an interface on an appropriate product surface that so that linguists on the ops team are able to operate the QE model when required with confidence and autonomy and without intervention from the AI teams.

Business objective

Operational efficiency, reduce cost

Business objective

Operational efficiency, reduce cost

Project type

Improve unused feature

Project type

Improve unused feature

My role

Sr. Product designer

My role

Sr. Product designer

Skills highlight

Interaction design, AI/ML

Skills highlight

Interaction design, AI/ML

Introduction

Unbabel uses a multi-step translation pipeline that transforms text from the source language into text in the target language, to produce customer specific translations at scale and speed (< 20 mins).

The QE model sits between the Machine translation (MT) and Human post edition steps. It measures the quality of the MT translation and decides if the translation needs to be further edited/improved by a human translator.

The unbabel pipeline
The unbabel pipeline

My role

My team was responsible for building the canvas experience and the modules in it while the AI team built a service wrapping QE configurations for us to consume. I collaborated with 2 product managers on this project, one on my team and one on the AI team to ensure that interface aligned with user needs and with the underlying API.

Designing a module for the workflow canvas brought with it a set of specific constraints. I worked through them with the front end dev on my team and aligning with the designer on the adjacent team who was working on other modules.

Problem definition

To surface QE for the ops team, we needed to map how QE would benefit the customer and then craft an interface that supported the ops team to make informed decisions.

the problem definiton
the problem definiton

Approach

How QE works?

The QE model worked by predicting the quality of the every translated sentence (or segment) and associating that predicted quality with a score, i.e quality estimate score (QE score).

Aggregating the segment scores into a document level score allowed the model to have an understanding of the overall quality of the translated text.

Then, by evaluating the document level score against a threshold enabled the QE model to decide if the translation needed human edition or if it was good enough to skip human edition.

how QE works
how QE works

Introducing QE into the platform

Since QE configurations would impact the overall pipeline’s performance (quality, TAT, price), it felt logical to introduce QE configurations in the context of the entire pipeline.

The QE model actually does two things - scores the translation and decides if the translation should skip human post edition (based on that score). Since we were only enabling configurations on the skipping behaviour, I proposed that we visually separate QE into two modules, giving rise to a brand new module that we named as the decision module.

The decision module would allow the ops team to configure only the skipping behaviour and contain all the skipping related logic and tradeoffs. This directed the ops team to focus on the only they could control and simplified operating QE though a single mental model - its skipping behaviour.

original UI in pipeline builder
original UI in pipeline builder

Inverting the paradigm - "QE profiles"

Surfacing QE in the decision module resonated well but interacting with QE scores and threshold decimals was not intuitive at all. The primary challenge with numeric QE scores was that the ops team could never anticipate what the outcome would be.

QE profiles was developed as an abstraction over the QE score based on the if the translation would skip instead of looking at the score itself.

This essentially inverted the paradigm and enabled us to interact with QE models based on the outcome of their skipping behaviour.

Five QE profiles were developed by the AI team based on the skipping behaviour - Very conservative, Conservative, Balanced, Aggressive, Very aggressive. AS the name suggests, more aggressive profiles would skips more translations.

intro to QE profiles
intro to QE profiles

Design iteration 1

QE profiles was a breakthrough in how we talked about QE internally and externally. To rework the decision module with QE profiles instead of QE scores, I realised I would need to start afresh.

Brainstorming with the AI engineers led me to my first breakthrough - to use the historical performance of the model. When the ops team would request them to intervene, their process involved looking the quality of jobs skipped by the model in the past.

Using this as a starting point, I developed visual concepts where the decision module would allow you to simulate the outcome on a pipeline when the QE profile was switched.

divergent explorations
divergent explorations

After much debate and multiple design reviews, the consensus was to

  1. Line up the QE profiles on a scale to explain linearity

  2. Highlight the volume of translations that will be skipped

  3. Visualise the quality breakdown of the skipped translations

  4. Indicate the impact on the unit cost of translations

In order to avoid working in circles, I decided to test this interface with the ops team in a moderated usability test

divergent explorations
divergent explorations

Testing version 1

The above interface was tested with

  1. 8 participants in a,

  2. Remote moderated usability test

  3. 2 distinct and independent tasks

  4. Open feedback and post mortem was done in the end

UXR Findings

The results of the test were very positive with most participants keen to use the customers they managed. Most importantly, it enabled me to build a mental model on how the ops team viewed QE and what their priorities were.

usability test results
usability test results
usability test results
usability test results

Design iteration 2

Armed with a higher level of confidence, I proceeded to clean up the interface and further simplify the design based on feedback from the usability tests.

Primary design decisions that I took here were

  1. Juxtapose the skip slider with the volume of jobs being skipped. This is a direct relationship.

  2. Remove TAT as this was not a module level impact.

  3. Remove price as a decision vector as it was not relevant at the time of development

final designs 1
final designs 1
final designs 2
final designs 2

Outcomes

After wrapping up the designs, I handed over an instrumentation plan to the front end dev as my final contribution on this project.

Feedback was positive on the the 2nd iteration and after deploy, there was an uptick recorded in the activity in the decision module.

While we monitored usage and opened channels for feedback to inform future iterations, this project was closed with this deploy as we set about to build other parts of the platform!

outcomes
outcomes

Reflections

Designing for a Human-AI interaction paradigm in 2022 (read pre chatGPT) presented fun new challenges and many opportunities for collaborative head scratching! Over the course of this project, I truly internalised the non-deterministic nature of the system powering the UI and the importance of have well processed datasets (clean training data for example).

During the course of UX testing, I also realised that no matter how many guardrails we introduce, only after a period of experimentation and constant tweaking, will the ops team gain the true confidence to rely on the decision module for their customer needs. This was undoubtedly just the beginning of this journey.