Description
Background:
The PyMlirContext has a mechanism for creating unique references to MlirOperation
objects. When a PyOperation
object is created, it is interned into a map (together with an owning py::object
) into a map called liveOperations
held by the PyMlirContext
. This allows for various checks (assertions in the C++ code or tests written in Python) to occur to ensure that objects have the appropriate lifetimes. Since MLIR IR has a nested structure, it's possible to create a reference to an Operation in Python and then do something funny (e.g. delete the parent), and then try to use the child op somehow. In C++, printing a freed Operation
will probably crash, but in Python printing a PyOperation
whose underlying Operation*
has been freed should recover gracefully. For reasons like this (I think), the PyOperation retains a valid
member variable which should be set to false
in such situations, and most operations in PyOperation
are guarded by a check to ensure valid == true
.
This seems to work well, but there are several holes in how lifetimes of PyOperations are tracked (and therefore how they are inserted or cleaned up from the liveOperations
map). These are mostly documented in IRCore.cpp
as TODOs.
For example, the following code appears to crash:
def testModuleCleanup():
ctx = Context()
module = Module.parse(r"""module @my_module {
func.func @my_func() {
return
}
}""", ctx)
func = module.body.operations[0]
assert ctx._get_live_module_count() == 1
assert ctx._get_live_operation_count() == 1
module = None
gc.collect()
assert ctx._get_live_module_count() == 0
assert ctx._get_live_operation_count() == 1
print(func)
The reason is that even though module
(PyModule
) was dereferenced and cleaned up, it didn't invalidate its children. Therefore, the PyOperation
pointed to by func
still has its valid bit set (and is still in the liveOperations
map).
Furthermore, even if you don't do anything with func
, it remaining in liveOperations
is problematic. If you re-use the same Context, as in the below code, the MLIR C++ API and the underlying allocator may return a pointer that is still in liveOperations
(since operations may be freed by MLIR but still exist in liveOperations
, that is a valid state):
def testModuleCleanup():
ctx = Context()
module = Module.parse(r"""module @my_module {
func.func @my_func() {
return
}
}""", ctx)
func = module.body.operations[0] # suppose the underlying Operation* == 0x5e89746a58b0
assert ctx._get_live_module_count() == 1
assert ctx._get_live_operation_count() == 1
module = None
gc.collect()
assert ctx._get_live_module_count() == 0
# live operations is still 1, this is expected. In `liveOperations` map,
# PyOperation are tracked by the underlying `Operation*`, which means the key 0x5e89746a58b0
# is still in the map
assert ctx._get_live_operation_count() == 1
# This can result in assertion unpredictably, because MLIR may return new Operation* == 0x5e89746a58b0,
# and the key exists in `liveOperation` map.
module = Module.parse(r"""module @my_module { }""")
An assertion will be encountered because of this line:
https://github.com/llvm/llvm-project/blame/main/mlir/lib/Bindings/Python/IRCore.cpp#L1164
Here you can see that even if we amend the assertion to include a check if it is valid
, proceeding to update the liveOperations
map may invalidate a reference still held by the Python user. IMO this is overly cautious because there are too many ways in which the liveOperations
tracking mechanism can loose track of which operations are valid or not.
In addition, compiling MLIR with CMAKE_BUILD_TYPE=RelWithDebInfo vs CMAKE_BUILD_TYPE=Debug appears to change the likelihood of encountering the assertion in the second example.