Submitted by chaotycmunkey t3_11qwzb6 in MachineLearning

Hola!

I am working on comparing some models, few of which have been implemented in PyTorch and the rest of them in Tensorflow (some in 1.x and others in 2.x versions). I know if they are implemented well, one should be able to simply compare their graphs/performances regardless of the platform. But often there might be some subtle differences in the implementations (within the platforms themselves and the way model code utilizes it) that can make it painful to trust the training. Some models are from official sources so I'd rather not verify much of their code before using them. Of course, I don't want to implement all of them into a single platform unless I must.

If you have come across such a problem, how have you dealt with it? Are there certain tests you would conduct to ensure the loss curves can be compared? How would you go about this issue other than finding someone else's implementation of say, a TF model in PyTorch, and verifying it?

Sincerely, A man in crisis.

6

Comments

You must log in or register to comment.

cthorrez t1_jc5s8ag wrote

Basically I would just make sure the metrics being compared are computed the same way. Same numerator and denominator like summing vs averaging, over the batch vs epoch. If the datasets are the same and the type of metric you are computing is the same it's comparable.

The implementation details just become part of the comparison.

9

sugar_scoot t1_jc5qogu wrote

What's the purpose of your study?

2

Fapaak t1_jc5yuuj wrote

Sounds like a bachelor's thesis to me at least

3

chaotycmunkey OP t1_jc6fkn7 wrote

Writing my first paper. I have my own model that I'm going to compare against the SOTA on some RL tasks.

1

BellRock99 t1_jc7fsag wrote

Trust the implementation, or simply just use the metrics presented in their papers on the standard datasets. The latter is more correct in my opinion, since even your implementation could be wrong.

1

chaotycmunkey OP t1_jc7ki7b wrote

My goal is to test their model on a new dataset that I believe will fail to perform on. And my proposed model is supposed to be an improvement. As such, I have to run their model.

1