Viewing a single comment thread. View all comments

fullstackai t1_jcokcsq wrote

I treat code artifacts of ML pipelines like any other software. I aim for 100% test coverage. Probably a bit controversial, but I always keep a small amount of example data in the repo for unit and integration tests. Could also be downloaded from blob in the CI pipeline, but repo size is usually not the limiting factor.

7

-xylon t1_jcokzje wrote

Having a schema and generating random or synthetic data based on that schema is my way to go for testing.

8

nucLeaRStarcraft t1_jcoo30z wrote

more or less the same. However, the simplest way to start, at least that's what I found, is to randomize a sub sample of real data. It may be the case that synthetic data is simply too simple / does not capture the real distribution and can hide bugs.

Probably both is the ideal solution.

3

gdpoc t1_jcperei wrote

Also depends on privacy constraints, sometimes you can't persist the data.

5

Fender6969 OP t1_jcrnzzg wrote

Many of the clients I support have rather sensitive data and persisting this into a repo would be a security risk. I suppose creating synthetic data would be the next best alternative.

1

gamerx88 t1_jctp6px wrote

Is there a reason why you feel there is need for such rigour? 100% is quite an overkill even for the typical software projects IMO.

You probably end up having to write tests for even simple one liner functions which gets exhausting.

1

fullstackai t1_jcttgyd wrote

Should have been more precise. 100% of what goes into any pipeline or the deployment gets tested. We deploy many models on the edge in manufacturing. If the model fails, the production line might stand still. Can't risk that.

1