TheGuywithTehHat

TheGuywithTehHat t1_jcrsjlo wrote

Most of that makes sense. The only thing I would be concerned about is the model training test. Firstly, a unit test should test the smallest possible unit. You should have many unit tests to test your model, and you should focus on those tests being as simple as possible. Nearly every function you write should have its own unit test, and no unit test should test more than one function. Secondly, there is an important difference between verification and validation testing. Verification testing shouldn't test for any particular accuracy threshold or anything like that, it should at most verify things like "model.fit() causes the model to change" or "a linear regression model that is all zeroes produces an output of zero." Verification testing is what you put on your CI pipeline to sanity check your code before it gets merged to master. Validation testing, however, should test model accuracy. It should go on your CD pipeline, and should validate that the model you're trying to push to production isn't low quality.

2

TheGuywithTehHat t1_jcomptw wrote

Any specific part you're wondering about? General advice applies here: test each unit of your software, and then integrate the units and test them that way. For each unit, hardcode the input and then test that the output is what you expect. For unit tests, make them as simple as possible while still testing as much of the functionality as possible. For integration tests, make a variety of them ranging from just a couple combined units & simple input/output to end-to-end tests that simulate the real world as closely as possible.

This is all advice that's not specific to ML in any way. Anything more specific depends on so many factors that boil down to:

  1. What is your environment like?
  2. What do you expect to change between different runs of the test?

For example: Will your dataset change? Will it change just a little (MNIST to Fashion-MNIST) or a lot (MNIST to CIFAR)? Will your model change? Will it just be a new training run of the same model? Will the model architecture stay the same or will it change internally? Will the input or output format of the model change? How often will any of these changes happen? Which parts of the pipeline are manual, and which are automatic? For each part of the system, what are the consequences of it failing (does it merely block further development, or will you get angry calls from your clients)?

Edit: I think the best advice I can give is to test everything that can possibly be tested, but prioritize based on risk impact (chance_of_failure * consequences_of_failure).

7