The Decision module
How we created an interface to enable internal ops teams to operate an AI model
AI powered quality estimation (QE) is the secret sauce to provide B2B translations at scale. Operating a QE model however was a task that often required input from specialists in the AI team.
This project describes how we went about creating an interface on an appropriate product surface that so that linguists on the ops team are able to operate the QE model when required with confidence and autonomy and without intervention from the AI teams.
Introduction
Unbabel uses a multi-step translation pipeline that transforms text from the source language into text in the target language, to produce customer specific translations at scale and speed (< 20 mins).
The QE model sits between the Machine translation (MT) and Human post edition steps. It measures the quality of the MT translation and decides if the translation needs to be further edited/improved by a human translator.
My role
My team was responsible for building the canvas experience and the modules in it while the AI team built a service wrapping QE configurations for us to consume. I collaborated with 2 product managers on this project, one on my team and one on the AI team to ensure that interface aligned with user needs and with the underlying API.
Designing a module for the workflow canvas brought with it a set of specific constraints. I worked through them with the front end dev on my team and aligning with the designer on the adjacent team who was working on other modules.
Problem definition
To surface QE for the ops team, we needed to map how QE would benefit the customer and then craft an interface that supported the ops team to make informed decisions.
Approach
How QE works?
The QE model worked by predicting the quality of the every translated sentence (or segment) and associating that predicted quality with a score, i.e quality estimate score (QE score).
Aggregating the segment scores into a document level score allowed the model to have an understanding of the overall quality of the translated text.
Then, by evaluating the document level score against a threshold enabled the QE model to decide if the translation needed human edition or if it was good enough to skip human edition.
Introducing QE into the platform
Since QE configurations would impact the overall pipeline’s performance (quality, TAT, price), it felt logical to introduce QE configurations in the context of the entire pipeline.
The QE model actually does two things - scores the translation and decides if the translation should skip human post edition (based on that score). Since we were only enabling configurations on the skipping behaviour, I proposed that we visually separate QE into two modules, giving rise to a brand new module that we named as the decision module.
The decision module would allow the ops team to configure only the skipping behaviour and contain all the skipping related logic and tradeoffs. This directed the ops team to focus on the only they could control and simplified operating QE though a single mental model - its skipping behaviour.
Inverting the paradigm - "QE profiles"
Surfacing QE in the decision module resonated well but interacting with QE scores and threshold decimals was not intuitive at all. The primary challenge with numeric QE scores was that the ops team could never anticipate what the outcome would be.
QE profiles was developed as an abstraction over the QE score based on the if the translation would skip instead of looking at the score itself.
This essentially inverted the paradigm and enabled us to interact with QE models based on the outcome of their skipping behaviour.
Five QE profiles were developed by the AI team based on the skipping behaviour - Very conservative, Conservative, Balanced, Aggressive, Very aggressive. AS the name suggests, more aggressive profiles would skips more translations.
Design iteration 1
QE profiles was a breakthrough in how we talked about QE internally and externally. To rework the decision module with QE profiles instead of QE scores, I realised I would need to start afresh.
Brainstorming with the AI engineers led me to my first breakthrough - to use the historical performance of the model. When the ops team would request them to intervene, their process involved looking the quality of jobs skipped by the model in the past.
Using this as a starting point, I developed visual concepts where the decision module would allow you to simulate the outcome on a pipeline when the QE profile was switched.
After much debate and multiple design reviews, the consensus was to
Line up the QE profiles on a scale to explain linearity
Highlight the volume of translations that will be skipped
Visualise the quality breakdown of the skipped translations
Indicate the impact on the unit cost of translations
In order to avoid working in circles, I decided to test this interface with the ops team in a moderated usability test
Testing version 1
The above interface was tested with
8 participants in a,
Remote moderated usability test
2 distinct and independent tasks
Open feedback and post mortem was done in the end
UXR Findings
The results of the test were very positive with most participants keen to use the customers they managed. Most importantly, it enabled me to build a mental model on how the ops team viewed QE and what their priorities were.
Design iteration 2
Armed with a higher level of confidence, I proceeded to clean up the interface and further simplify the design based on feedback from the usability tests.
Primary design decisions that I took here were
Juxtapose the skip slider with the volume of jobs being skipped. This is a direct relationship.
Remove TAT as this was not a module level impact.
Remove price as a decision vector as it was not relevant at the time of development
Outcomes
After wrapping up the designs, I handed over an instrumentation plan to the front end dev as my final contribution on this project.
Feedback was positive on the the 2nd iteration and after deploy, there was an uptick recorded in the activity in the decision module.
While we monitored usage and opened channels for feedback to inform future iterations, this project was closed with this deploy as we set about to build other parts of the platform!
Reflections
Designing for a Human-AI interaction paradigm in 2022 (read pre chatGPT) presented fun new challenges and many opportunities for collaborative head scratching! Over the course of this project, I truly internalised the non-deterministic nature of the system powering the UI and the importance of have well processed datasets (clean training data for example).
During the course of UX testing, I also realised that no matter how many guardrails we introduce, only after a period of experimentation and constant tweaking, will the ops team gain the true confidence to rely on the decision module for their customer needs. This was undoubtedly just the beginning of this journey.












