fullstackai t1_jcokcsq wrote on March 18, 2023 at 10:45 AM

I treat code artifacts of ML pipelines like any other software. I aim for 100% test coverage. Probably a bit controversial, but I always keep a small amount of example data in the repo for unit and integration tests. Could also be downloaded from blob in the CI pipeline, but repo size is usually not the limiting factor.

-xylon t1_jcokzje wrote on March 18, 2023 at 10:54 AM

Having a schema and generating random or synthetic data based on that schema is my way to go for testing.

nucLeaRStarcraft t1_jcoo30z wrote on March 18, 2023 at 11:32 AM

more or less the same. However, the simplest way to start, at least that's what I found, is to randomize a sub sample of real data. It may be the case that synthetic data is simply too simple / does not capture the real distribution and can hide bugs.

Probably both is the ideal solution.

gdpoc t1_jcperei wrote on March 18, 2023 at 3:20 PM

Also depends on privacy constraints, sometimes you can't persist the data.

Fender6969 OP t1_jcrnzzg wrote on March 19, 2023 at 1:03 AM

Many of the clients I support have rather sensitive data and persisting this into a repo would be a security risk. I suppose creating synthetic data would be the next best alternative.

fleanend t1_jcq5dz7 wrote on March 18, 2023 at 6:21 PM

I'm glad I'm not the only one

gamerx88 t1_jctp6px wrote on March 19, 2023 at 2:19 PM

Is there a reason why you feel there is need for such rigour? 100% is quite an overkill even for the typical software projects IMO.

You probably end up having to write tests for even simple one liner functions which gets exhausting.

fullstackai t1_jcttgyd wrote on March 19, 2023 at 2:51 PM

Should have been more precise. 100% of what goes into any pipeline or the deployment gets tested. We deploy many models on the edge in manufacturing. If the model fails, the production line might stand still. Can't risk that.

gamerx88 t1_jcx0t9r wrote on March 20, 2023 at 5:21 AM

Ah, that makes sense.