Description
Feature Request
Right now, the ONNX model zoo consists of one batch of models, which have been accumulated over the years. Some include quite recent models coded with recent opsets, and others are older often encoded with early opsets. At this time, I believe is that user see only one number, e.g. the fraction of models that a given tool is supporting. However, this number may not directly indicates to user how up-to-date a given tool is. For example, a tool that supports all of the newer ops but not older ops may get a similar score as a tool that gets all of the older ops but is not supporting some of the more recent opsets. Distinguishing these scores would provide values to a user.
Ideally, we would continue to modernize the older benchmarks to use the most recent opsets. I would suggest in addition that we keep the old benchmark and split them in 2 categories: one including legacy models with legacy ops (e.g. including deprecated ops and/or attributes/inputs) and one including the recent models.
What is the problem that this feature solves?
By having 2 sets of benchmarks (one including legacy models and the other recent models), the coverage and applied optimizations will become more directly consumable to users, as they will be able to better evaluate the usefulness of a tool for current and legacy apps.
Describe the feature
I would look at the current benchmarks, and determine which ones are considered recent and which one are older. We would discuss what appropriate criteria to use. Alternatively, we could also bin each of the benchmarks by opset and provide a number per opset.
We can continue using converters, e.g. to include the up-converted older benchmark as part of the recent models, to the extend that the up-conversion does not generate graphs that are too distinct from currently generated ones.
One positive side effect of this effort is that we can also better separate the performance of the converters, which is to transform one opset into another one, with that of execution tools (runtime/compilers) which is to execute a given model.
Relevant Model Zoo feature areas
Mostly impact the organization of the benchmark and performance reporting.
Notes
I know that this issue has been discussed before, and I am looking forward to learning from your past observations.
There is also interest in the community in a slightly separate issue, namely on how to convert older models that use uniform precision to newer models that work on (possibly multiple) reduced precision. Again, having a benchmark set that can better highlight the benefit of a tool on converting such benchmark would be an attractive metric to present to the users.