ML Test Score Rubric for Production Readiness

ML Test Score Rubric for Production Readiness

The ML Test Score provides a comprehensive rubric designed to assess the production readiness of machine learning systems and reduce technical debt. It outlines 28 specific tests and monitoring practices based on extensive experience with real-world ML systems. This framework is essential for teams aiming to ensure reliability and maintainability in their ML projects. By implementing these tests, organizations can improve their ML systems' performance and reduce long-term maintenance costs. This resource is invaluable for data scientists and engineers working in machine learning environments.

Key Points

  • Presents 28 actionable tests for assessing ML system readiness
  • Focuses on reducing technical debt in machine learning projects
  • Offers a scoring system to measure production readiness
  • Guides teams from beginner to advanced ML testing practices
147
/ 10
The ML Test Score:
A Rubric for ML Production Readiness and Technical Debt Reduction
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley
Google, Inc.
ebreck, cais, nielsene, msalib, dsculley@google.com
Abstract—Creating reliable, production-level machine learn-
ing systems brings on a host of concerns not found in
small toy examples or even large offline research experiments.
Testing and monitoring are key considerations for ensuring
the production-readiness of an ML system, and for reducing
technical debt of ML systems. But it can be difficult to formu-
late specific tests, given that the actual prediction behavior of
any given model is difficult to specify a priori. In this paper,
we present 28 specific tests and monitoring needs, drawn from
experience with a wide range of production ML systems to help
quantify these issues and present an easy to follow road-map
to improve production readiness and pay down ML technical
debt.
Keywords-Machine Learning, Testing, Monitoring, Reliabil-
ity, Best Practices, Technical Debt
I. INTRODUCTION
As machine learning (ML) systems continue to take on
ever more central roles in real-world production settings,
the issue of ML reliability has become increasingly critical.
ML reliability involves a host of issues not found in small
toy examples or even large offline experiments, which can
lead to surprisingly large amounts of technical debt [1].
Testing and monitoring are important strategies for improv-
ing reliability, reducing technical debt, and lowering long-
term maintenance cost. However, as suggested by Figure
1, ML system testing is also more complex a challenge
than testing manually coded systems, due to the fact that
ML system behavior depends strongly on data and models
that cannot be strongly specified a priori. One way to see
this is to consider ML training as analogous to compilation,
where the source is both code and training data. By that
analogy, training data needs testing like code, and a trained
ML model needs production practices like a binary does,
such as debuggability, rollbacks and monitoring.
So, what should be tested and how much is enough?
In this paper, we try to answer this question with a test
rubric, which is based on engineering decades of production-
level ML systems at Google, in systems such as ad click
prediction [2] and the Sibyl ML platform [3].
We present a rubric as a set of 28 actionable tests, and
offer a scoring system to measure how ready for production
a given machine learning system is. This rubric is intended
to cover a range from a team just starting out with machine
learning up through tests that even a well-established team
may find difficult. Note that this rubric focuses on issues
specific to ML systems, and so does not include generic
software engineering best practices such as ensuring good
unit test coverage and a well-defined binary release process.
Such strategies remain necessary as well. We do call out
a few specific areas for unit or integration tests that have
unique ML-related behavior.
How to read the tests: Each test is written as an
assertion; our recommendation is to test that the assertion is
true, the more frequently the better, and to fix the system if
the assertion is not true.
Doesn’t this all go without saying?: Before we enu-
merate our suggested tests, we should address one objection
the reader may have obviously one should write tests for
an engineering project! While this is true in principle, in a
survey of several dozen teams at Google, none of these tests
was implemented by more than 80% of teams (though, even
in a engineering culture valuing rigorous testing, many of
these ML-centric tests are non-obvious). Conversely, most
tests had a nonzero score for at least half of the teams
surveyed; our tests do represent practices that teams find
to be worth doing.
In this paper, we are largely concerned with supervised
ML systems that are trained continuously online and perform
rapid, low-latency inference on a server. Features are often
derived from large amounts of data such as streaming logs
of incoming data. However, most of our recommendations
apply to other forms of ML systems, such as infrequently
trained models pushed to client-side systems for inference.
A. Related work
Software testing is well studied, as is machine learning,
but their intersection has been less well explored in the
literature. [4] reviews testing for scientific software more
generally, and cites a number of articles such as [5], who
present an approach for testing ML algorithms. These ideas
are a useful complement for the tests we present, which are
focused on testing the use of ML in a production system
rather than just the correctness of the ML algorithm per se.
Zinkevich provides extensive advice on building effective
machine learning models in real world systems [6]. Those
rules are complementary to this rubric, which is more
c
2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works. Published as [7].
Figure 1. ML Systems Require Extensive Testing and Monitoring. The key consideration is that unlike a manually coded system (left), ML-based
system behavior is not easily specified in advance. This behavior depends on dynamic qualities of the data, and on various model configuration choices.
concerned with determining how reliable an ML system is
rather than how to build one.
Issues of surprising sources of technical debt in ML
systems has been studied before [1]. It has been noted that
the prior work has identified problems but been largely silent
on how to address them; this paper details actionable advice
drawn from practice and verified with extensive interviews
with the maintainers of 36 real world systems.
II. TESTS FOR FEATURES AND DATA
Machine learning systems differ from traditional software-
based systems in that the behavior of ML systems is not
specified directly in code but is learned from data. Therefore,
while traditional software can rely on unit tests and integra-
tion tests of the code, here we attempt to add a sufficient
set of tests of the data.
Data 1: Feature expectations are captured in a
schema: It is useful to encode intuitions about the data
in a schema so they can be automatically checked. For
example, an adult human is surely between one and ten
feet in height. The most common word in English text is
probably ‘the’, with other word frequencies following a
power-law distribution. Such expectations can be used for
tests on input data during training and serving (see test
Monitor 2).
How? To construct the schema, one approach is to start
with calculating statistics from training data, and then ad-
justing them as appropriate based on domain knowledge. It
may also be useful to start by writing down expectations
and then compare them to the data to avoid an anchoring
1 Feature expectations are captured in a schema.
2 All features are beneficial.
3 No feature’s cost is too much.
4 Features adhere to meta-level requirements.
5 The data pipeline has appropriate privacy controls.
6 New features can be added quickly.
7 All input feature code is tested.
Table I
BRIEF LISTING OF THE SEVEN DATA TESTS.
bias. Visualization tools such as Facets
1
can be very useful
for analyzing the data to produce the schema. Invariants to
capture in a schema can also be inferred automatically from
your system’s behavior [8].
Data 2: All features are beneficial: A kitchen-sink
approach to features can be tempting, but every feature
added has a software engineering cost. Hence, it’s important
to understand the value each feature provides in additional
predictive power (independent of other features).
How? Some ways to run this test are by computing
correlation coefficients, by training models with one or two
features, or by training a set of models that each have one
of k features individually removed.
Data 3: No feature’s cost is too much: It is not
only a waste of computing resources, but also an ongoing
maintenance burden to include -features that add only
minimal predictive benefit [1].
How? To measure the costs of a feature, consider not
only added inference latency and RAM usage, but also
more upstream data dependencies, and additional expected
instability incurred by relying on that feature. See Rule#22
[6] for further discussion.
Data 4: Features adhere to meta-level requirements:
Your project may impose requirements on the data coming
in to the system. It might prohibit features derived from user
data, prohibit the use of specific features like age, or simply
prohibit any feature that is deprecated. It might require all
features be available from a single source. However, during
model development and experimentation, it is typical to try
out a wide variety of potential features to improve prediction
quality.
How? Programmatically enforce these requirements, so
that all models in production properly adhere to them.
Data 5: The data pipeline has appropriate privacy
controls: Training data, validation data, and vocabulary files
all have the potential to contain sensitive user data. While
teams often are aware of the need to remove personally iden-
tifiable information (PII), during this type of exporting and
1
https://pair-code.github.io/facets/
transformations, programming errors and system changes
can lead to inadvertent PII leakages that may have serious
consequences.
How? Make sure to budget sufficient time during new
feature development that depends on sensitive data to allow
for proper handling. Test that access to pipeline data is
controlled as tightly as the access to raw user data, especially
for data sources that haven’t previously been used in ML.
Finally, test that any user-requested data deletion propagates
to the data in the ML training pipeline, and to any learned
models.
Data 6: New features can be added quickly: The
faster a team can go from a feature idea to the feature
running in production, the faster it can both improve the
system and respond to external changes. For highly efficient
teams, this can be as little as one to two months even for
global-scale, high-traffic ML systems. Note that this can
be in tension with Data 5, but privacy should always take
precedence.
Data 7: All input feature code is tested: Feature
creation code may appear simple enough to not need unit
tests, but this code is crucial for correct behavior and so
its continued quality is vital. Bugs in features may be
almost impossible to detect once they have entered the data
generation process, especially if they are represented in both
training and test data.
III. TESTS FOR MODEL DEVELOPMENT
While the field of software engineering has developed a
full range of best practices for developing reliable software
systems, similar best-practices for ML model development
are still emerging.
Model 1: Every model specification undergoes a
code review and is checked in to a repository: It can
be tempting to avoid code review out of expediency, and
run experiments based on one’s own personal modifications.
In addition, when responding to production incidents, it’s
crucial to know the exact code that was run to produce a
given learned model. For example, a responder might need
to re-run training with corrected input data, or compare the
result of a particular modification. Proper version control of
the model specification can help make training auditable and
improve reproducibility.
1 Model specs are reviewed and submitted.
2 Offline and online metrics correlate.
3 All hyperparameters have been tuned.
4 The impact of model staleness is known.
5 A simpler model is not better.
6 Model quality is sufficient on important data slices.
7 The model is tested for considerations of inclusion.
Table II
BRIEF LISTING OF THE SEVEN MODEL TESTS
Model 2: Offline proxy metrics correlate with actual
online impact metrics: A user-facing production system’s
impact is judged by metrics of engagement, user happiness,
revenue, and so forth. A machine learning system is trained
to optimize loss metrics such as log-loss or squared error.
A strong understanding of the relationship between these
offline proxy metrics and the actual impact metrics is needed
to ensure that a better scoring model will result in a better
production system.
How? The offline/online metric relationship can be mea-
sured in one or more small scale A/B experiments using an
intentionally degraded model.
Model 3: All hyperparameters have been tuned:
A ML model can often have multiple hyperparameters,
such as learning rates, number of layers, layer sizes and
regularization coefficients. Choice of the hyperparameter
values can have dramatic impact on prediction quality.
How? Methods such as a grid search [9] or a more
sophisticated hyperparameter search strategy [10] [11] not
only improve prediction quality, but also can uncover hid-
den reliability issues. Substantial performance improvements
have been realized in many ML systems through use of an
internal hyperparameter tuning service[12]
2
.
Model 4: The impact of model staleness is known:
Many production ML systems encounter rapidly changing,
non-stationary data. Examples include content recommen-
dation systems and financial ML applications. For such
systems, if the pipeline fails to train and deploy sufficiently
up-to-date models, we say the model is stale. Understanding
how model staleness affects the quality of predictions is
necessary to determine how frequently to update the model.
If predictions are based on a model trained yesterday versus
last week versus last year, what is the impact on the
live metrics of interest? Most models need to be updated
eventually to account for changes in the external world;
a careful assessment is important to decide how often to
perform the updates (see Rule 8 in [6] for related discussion).
How? One way of testing the impact of staleness is with
a small A/B experiment with older models. Testing a range
of ages can provide an age-versus-quality curve to help
understand what amount of staleness is tolerable.
Model 5: A simpler model is not better: Regularly
testing against a very simple baseline model, such as a linear
model with very few features, is an effective strategy both
for confirming the functionality of the larger pipeline and
for helping to assess the cost to benefit tradeoffs of more
sophisticated techniques.
Model 6: Model quality is sufficient on all important
data slices: Slicing a data set along certain dimensions of
interest can improve fine-grained understanding of model
quality. Slices should distinguish subsets of the data that
might behave qualitatively differently, for example, users by
2
The service is closely related to HyperTune[13].
/ 10
End of Document
147
You May Also Like

FAQs of ML Test Score Rubric for Production Readiness

What is the purpose of the ML Test Score rubric?
The ML Test Score rubric is designed to evaluate the production readiness of machine learning systems and to help teams identify and reduce technical debt. It provides a structured approach to testing and monitoring, ensuring that ML systems are reliable and maintainable over time. By following this rubric, organizations can systematically improve their ML practices and enhance the overall quality of their systems.
How many tests are included in the ML Test Score rubric?
The rubric includes 28 specific tests that cover various aspects of machine learning system readiness. These tests are drawn from real-world experiences and aim to address the unique challenges faced by ML systems in production. The tests help teams evaluate their systems' performance, reliability, and overall effectiveness.
Who can benefit from using the ML Test Score rubric?
Data scientists, machine learning engineers, and organizations developing ML systems can all benefit from the ML Test Score rubric. It serves as a practical guide for teams looking to improve their testing and monitoring practices, ultimately leading to more reliable and maintainable ML systems. The rubric is suitable for both novice and experienced teams, providing a roadmap for enhancing production readiness.
What are some key areas covered by the ML Test Score tests?
The tests in the ML Test Score rubric cover various key areas, including feature validation, model development, infrastructure reliability, and ongoing monitoring. Each area addresses specific challenges and best practices relevant to machine learning systems, ensuring comprehensive evaluation and improvement. By focusing on these critical aspects, teams can better manage their ML projects and reduce potential risks.
How does the scoring system work in the ML Test Score rubric?
The scoring system in the ML Test Score rubric assigns points based on the implementation of the 28 tests. Teams can earn half a point for manually executing a test and a full point for automating it. The final score is determined by taking the minimum score from four sections, emphasizing the importance of addressing all areas of testing and monitoring for comprehensive system readiness.

Related of ML Test Score Rubric for Production Readiness