[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios

Background:

The PyMlirContext has a mechanism for creating unique references to `MlirOperation` objects. When a `PyOperation` object is created, it is interned into a map (together with an owning `py::object`) into a map called `liveOperations`  held by the `PyMlirContext`. This allows for various checks (assertions in the C++ code or tests written in Python) to occur to ensure that objects have the appropriate lifetimes. Since MLIR IR has a nested structure, it's possible to create a reference to an Operation in Python and then do something funny (e.g. delete the parent), and then try to use the child op somehow. In C++,  printing a freed `Operation` will probably crash, but in Python printing a `PyOperation` whose underlying `Operation*` has been freed should recover gracefully. For reasons like this (I think), the PyOperation retains a `valid` member variable which should be set to `false` in such situations, and most operations in `PyOperation` are guarded by a check to ensure `valid == true`.

This seems to work well, but there are several holes in how lifetimes of PyOperations are tracked (and therefore how they are inserted or cleaned up from the `liveOperations` map). These are mostly documented in `IRCore.cpp` as TODOs. 

For example, the following code appears to crash:

```
def testModuleCleanup():
    ctx = Context()
    module = Module.parse(r"""module @my_module {
        func.func @my_func() {
            return
        }
    }""", ctx)
    func = module.body.operations[0]
    assert ctx._get_live_module_count() == 1
    assert ctx._get_live_operation_count() == 1
    module = None
    gc.collect()
    assert ctx._get_live_module_count() == 0
    assert ctx._get_live_operation_count() == 1
    print(func)
```

The reason is that even though `module` (`PyModule`) was dereferenced and cleaned up, it didn't invalidate its children. Therefore, the `PyOperation` pointed to by `func` still has its valid bit set (and is still in the `liveOperations` map).

Furthermore, even if you don't do anything with `func`, it remaining in `liveOperations` is problematic. If you re-use the same Context, as in the below code, the MLIR C++ API and the underlying allocator may return a pointer that is still in `liveOperations` (since operations may be freed by MLIR but still exist in `liveOperations`, that is a valid state):

```
def testModuleCleanup():
    ctx = Context()
    module = Module.parse(r"""module @my_module {
        func.func @my_func() {
            return
        }
    }""", ctx)
    func = module.body.operations[0]              # suppose the underlying Operation* == 0x5e89746a58b0
    assert ctx._get_live_module_count() == 1
    assert ctx._get_live_operation_count() == 1
    module = None
    gc.collect()
    assert ctx._get_live_module_count() == 0
   
   # live operations is still 1, this is expected. In `liveOperations` map, 
   # PyOperation are tracked by the underlying `Operation*`, which means the key 0x5e89746a58b0
   # is still in the map
    assert ctx._get_live_operation_count() == 1
   
   #  This can result in assertion unpredictably, because MLIR may return new Operation* == 0x5e89746a58b0, 
   # and the key exists in `liveOperation` map.
   module = Module.parse(r"""module @my_module { }""")
    
```

An assertion will be encountered because of this line:
https://github.com/llvm/llvm-project/blame/main/mlir/lib/Bindings/Python/IRCore.cpp#L1164

Here you can see that even if we amend the assertion to include a check if it is `valid`, proceeding to update the `liveOperations` map may invalidate a reference still held by the Python user. IMO this is overly cautious because there are too many ways in which the `liveOperations` tracking mechanism can loose track of which operations are valid or not.


In addition, compiling MLIR with CMAKE_BUILD_TYPE=RelWithDebInfo vs CMAKE_BUILD_TYPE=Debug appears to change the likelihood of encountering the assertion in the second example.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios #92344

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[mlir][python] Tracking of PyOperations references in liveOperations map breaks down in reasonable scenarios #92344

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios #92344