Open
Description
Currently it takes 3.4 - 3.45 us (depending on stream-ordering or not) to create a memory view object:
In [4]: x = cp.empty((23, 4))
In [7]: %timeit s = StridedMemoryView(x, -1)
3.4 μs ± 8.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [8]: %timeit s = StridedMemoryView(x, 1)
3.45 μs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
which could be a bit expensive in a tight loop. We should try to reduce it down to 1 us or O(100) ns if possible.
cc @shwina for vis