Pre-launch workgroupsize auto-tuning

If the caller (host-side code) of a kernel needs to pre-allocate buffer that depends on workgroupsize and the workgroupsize is not specified, the caller needs to run the auto-tuning of workgroupsize before launching the kernel. For example, I used it for implementing ["mapreduce" kernel in FoldsCUDA.jl](https://github.com/JuliaFolds/FoldsCUDA.jl/blob/f45842d1706c4dde7af73e052757aa878bd4006b/src/kernels.jl#L87-L98). Can we have an API for invoking workgroupsize auto-tuning before launching the kernel?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-launch workgroupsize auto-tuning #216

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pre-launch workgroupsize auto-tuning #216

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions