link0007 t1_iz11xzx wrote on December 5, 2022 at 6:08 PM

It's so strange the python ML community has still not found a suitable model format, despite years and years of effort. What even happened to efforts like PMML?

Meanwhile I'm quite happy with the R infrastructure with storing tidymodels pipelines.

unofficialmerve OP t1_iz12b00 wrote on December 5, 2022 at 6:10 PM

I think it's because this was raised recently and people really didn't know! since it was raised by Yannic Kilcher couple of months ago, François Chollet has announced a format specific to Keras models, Pytorch folks are cooking something for this too. hopefully this will be solved 🙂

ReginaldIII t1_iz1f43w wrote on December 5, 2022 at 7:33 PM

Tidymodels is a specific example of an R extension package with it's own file format. That would be like saying you are quite happy with the Python infrastructure for saving PyTorch models. It's still specific to that thing.

There are plenty of good ways of storing model weights, those based on hdf5 archives being a great choice since they are optimized for block tensor operations, on disk chunking, support lazy slicing, and support nested groups of tensors. Keras uses hdf5 for it's save_weights and load_weights functions.

If your models are getting huge you need a different strategy anyway. And this is where S3 object store backed systems like TensorStore become more ideal.

unofficialmerve OP t1_iz1iybe wrote on December 5, 2022 at 7:57 PM

h5 and SavedModel of TF are the safest options, yet you can still inject code through Lambda layers or subclassed models (that's why Keras developed a new format too!) AFAIK. What SavedModel does is that it reconstructs the architecture and loads weights into it, and this architecture part is essentially code (loading the weights is never the problem for any framework anyway, it's the code part!). so again, you shouldn't deserialize it. (safest code is no code) if you can see the architecture and confirm that it doesn't have any custom layers, you should be fine. (this is also essentially what we do with skops (we audit the model) (or reconstruct it yourself and load weights into it but it's a little tricky, you might have custom objects or e.g. preprocessing layers for keras)

>The architecture of subclassed models and layers are defined in the methods __init__ and call. They are considered Python bytecode, which cannot be serialized into a JSON-compatible config -- you could try serializing the bytecode (e.g. via pickle), but it's completely unsafe and means your model cannot be loaded on a different system. (in model subclassing guide)
>
>WARNING: tf.keras.layers.Lambda layers have (de)serialization limitations! (in lambda layers guide)

Hugging Face also introduced a new format called safetensors if you're interested: https://github.com/huggingface/safetensors in README there's a detailed explanation & comparison.

lmericle t1_iz1kk77 wrote on December 5, 2022 at 8:07 PM

What about ONNX? Most if not all feedforward models can be represented as ONNX.

link0007 t1_iz1nhuf wrote on December 5, 2022 at 8:26 PM

Yes! I knew there was another standard but I couldn't for the life of me remember the name.

Perhaps it's also just a matter of the Python crowd doing a bit more complicated stuff than the R crowd. For me the models tend to be quite straightforward RF or related models (like I said; tidymodels), but the demand is much more on the correct pipeline with pre- and postprocessing. Things become a bit less easy to store once you go into deep neural networks I'd imagine.

unofficialmerve OP t1_iz1nurr wrote on December 5, 2022 at 8:28 PM

I'm not sure but I was told a lot of times that ONNX support for sklearn was sub-par. I haven't researched that one yet. I can ask to maintainers if you're interested.

link0007 t1_iz1p01p wrote on December 5, 2022 at 8:36 PM

I don't use python so no need! I just remember being quite confused when I was learning sklearn and realised saving models or pipelines was weirdly complicated compared to R.

More generally speaking I suppose the RDS data format is pretty great to work with within R.

arsenyinfo t1_iz2hxgi wrote on December 5, 2022 at 11:55 PM

I deployed sklearn models via onnx in two companies, and it works perfect.