Skip to content

Global scan #665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 62 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
a34827a
Merge branch 'atomicAdd-fix' into global_scan
kpentaris Jan 29, 2024
d61a2aa
First commit for global scan implementation. Ported most of the code …
kpentaris Jan 30, 2024
0eec0ee
Global scan migration of code to new APIs
kpentaris Feb 11, 2024
2185b4e
Merge branch 'Devsh-Graphics-Programming:master' into global_scan
kpentaris Feb 12, 2024
2df4ad7
First commit for global scan implementation. Ported most of the code …
kpentaris Jan 30, 2024
20a60f5
Global scan migration of code to new APIs
kpentaris Feb 11, 2024
960dbf2
Merge remote
kpentaris Mar 1, 2024
2be8f13
Merge branch 'Devsh-Graphics-Programming:master' into global_scan
kpentaris Mar 11, 2024
6366697
Merge branch 'master' into global_scan
kpentaris Mar 15, 2024
26cb75d
Merge branch 'master' into global_scan
kpentaris Mar 22, 2024
7117cc3
Update CScanner.h to new NBL API
kpentaris Mar 31, 2024
fa979c7
Merge branch 'master' of https://github.com/kpentaris/Nabla into glob…
kpentaris Mar 31, 2024
53e1655
Merge branch 'master' into global_scan
kpentaris Apr 3, 2024
0bcc325
Merge branch 'master' into global_scan
kpentaris Apr 10, 2024
71f4398
Merge branch 'master' into global_scan
kpentaris Apr 14, 2024
51b4f74
Fix CScanner and related shaders to properly compile
kpentaris Apr 21, 2024
4b29049
Merge branch 'master' into global_scan
kpentaris Apr 21, 2024
277db27
Merge branch 'master' into global_scan
kpentaris Apr 27, 2024
2b0c1a2
Merge branch 'master' into global_scan
kpentaris May 1, 2024
056cbaf
Merge branch 'master' into global_scan
kpentaris May 4, 2024
3d29bc3
Update formatting and fix scratch buffer size
kpentaris May 4, 2024
4f9f9ee
Merge branch 'master' into global_scan
kpentaris May 10, 2024
af438bc
Intermediary update of code before re-implementation
kpentaris May 10, 2024
6c45bc6
Merge branch 'master' into global_scan
kpentaris May 19, 2024
61d4806
Initial implementation of global reduce
kpentaris May 20, 2024
44edb59
Fix issues when WG > 1
kpentaris May 20, 2024
14d66a6
Merge branch 'master' into global_scan
kpentaris May 25, 2024
cdcc5d0
Change .gitmodules examples to point to fork
kpentaris Jun 2, 2024
74f6dab
Merge upstream changes
kpentaris Jun 2, 2024
1565fb8
update .gitmodules to also point to branch for examples submodule
kpentaris Jun 2, 2024
e16a195
Merge branch 'master' into global_scan
kpentaris Jun 2, 2024
734e84a
CReduce implementation
kpentaris Jun 9, 2024
40953e3
Fix issues with global reduce algorithm
kpentaris Jun 17, 2024
cc43691
Merge branch 'Devsh-Graphics-Programming:master' into global_scan
kpentaris Jun 19, 2024
3a5d1ff
Merge branch 'master' of https://github.com/kpentaris/Nabla
kpentaris Jun 29, 2024
658ac5b
Merge branch 'master' into global_scan
kpentaris Jun 29, 2024
d4a947d
Merge branch 'global_scan' of https://github.com/kpentaris/Nabla into…
kpentaris Jul 6, 2024
03ae90a
Merge upstream master
kpentaris Jul 6, 2024
3d252d0
refactor global reduce to properly work with required global scan cha…
kpentaris Jul 7, 2024
dd2cc09
Remove unused pseudoLevel parameter
kpentaris Jul 7, 2024
ce78fc0
Fix workgroupFinishFlagsOffset[0] being twice the needed size
kpentaris Jul 7, 2024
f6d7adb
Merge upstream master
kpentaris Jul 21, 2024
0d71686
Revert .gitmodules pointing to examples fork
kpentaris Jul 21, 2024
3ce449d
Merge master
kpentaris Aug 11, 2024
6a0c6c6
Merge master
kpentaris Sep 7, 2024
e467ece
merge master
kpentaris Sep 21, 2024
5dae823
Merge remote branch
kpentaris Sep 21, 2024
79ec513
Merge master to global_scan branch
kpentaris Feb 5, 2025
7f4fcd5
merge upstream Nable to local
kpentaris Feb 25, 2025
462fec5
Merge upstream to local
kpentaris Feb 25, 2025
3f9bdd8
merge upstream master to branch
kpentaris Mar 15, 2025
201636d
update 3rdparty to point to master submodules
kpentaris Mar 15, 2025
21bfa0f
update the hlsl/scan module paths to CMakeLists
kpentaris Mar 16, 2025
0a3728c
Update carithmeticops with proper enum paths
kpentaris Mar 16, 2025
1c32528
merge submodules changes to branch
kpentaris Mar 17, 2025
c9519c3
Merge branch 'master' into global_scan
kpentaris Mar 23, 2025
9f9ae44
Merge upstream master to branch
kpentaris Apr 6, 2025
d21e216
Merge branch 'global_scan' of https://github.com/kpentaris/Nabla into…
kpentaris Apr 6, 2025
a53fb5c
merge upstream master
kpentaris Apr 19, 2025
82bd9a1
Fix issues with ICPUBuffer changes
kpentaris Apr 19, 2025
87ae80f
Merge upstream master
kpentaris Apr 21, 2025
6e740b3
Fix bad shared-mem accessor pattern in direct.hlsl and properly expor…
kpentaris Apr 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 3rdparty/dxc/dxc
Submodule dxc updated 406 files
4 changes: 4 additions & 0 deletions include/nbl/builtin/hlsl/glsl_compat/core.hlsl
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,10 @@ void memoryBarrierShared() {
spirv::memoryBarrier(spv::ScopeDevice, spv::MemorySemanticsAcquireReleaseMask | spv::MemorySemanticsWorkgroupMemoryMask);
}

void memoryBarrierBuffer() {
spirv::memoryBarrier(spv::ScopeDevice, spv::MemorySemanticsAcquireReleaseMask | spv::MemorySemanticsUniformMemoryMask);
}

namespace impl
{

Expand Down
90 changes: 44 additions & 46 deletions include/nbl/builtin/hlsl/scan/declarations.hlsl
Original file line number Diff line number Diff line change
@@ -1,66 +1,64 @@
// Copyright (C) 2023 - DevSH Graphics Programming Sp. z O.O.
// This file is part of the "Nabla Engine".
// For conditions of distribution and use, see copyright notice in nabla.h

#ifndef _NBL_HLSL_SCAN_DECLARATIONS_INCLUDED_
#define _NBL_HLSL_SCAN_DECLARATIONS_INCLUDED_

// REVIEW: Not sure if this file is needed in HLSL implementation

#include "nbl/builtin/hlsl/scan/parameters_struct.hlsl"

#include "nbl/builtin/hlsl/cpp_compat.hlsl"

#ifndef _NBL_HLSL_SCAN_GET_PARAMETERS_DECLARED_
namespace nbl
{
namespace hlsl
{
namespace scan
{
Parameters_t getParameters();
}
}
}
#define _NBL_HLSL_SCAN_GET_PARAMETERS_DECLARED_
#ifndef NBL_BUILTIN_MAX_LEVELS
#define NBL_BUILTIN_MAX_LEVELS 7
Comment on lines +12 to +13

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a NBL_CONSTEXPR_STATIC_INLINE in Parameters_t

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep doing the same reviews
#665 (comment)

#endif

#ifndef _NBL_HLSL_SCAN_GET_PADDED_DATA_DECLARED_
namespace nbl
{
namespace hlsl
{
namespace scan
{
template<typename Storage_t>
void getData(
inout Storage_t data,
in uint levelInvocationIndex,
in uint localWorkgroupIndex,
in uint treeLevel,
in uint pseudoLevel
);
}
}
}
#define _NBL_HLSL_SCAN_GET_PADDED_DATA_DECLARED_
#endif
// REVIEW: Putting topLevel second allows better alignment for packing of constant variables, assuming lastElement has length 4. (https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-packing-rules)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have -fuse-scalar-layout DXC option, which packs everything tightly

struct Parameters_t {
uint32_t lastElement[NBL_BUILTIN_MAX_LEVELS/2+1];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add documentation about what lastElement is used for

uint32_t topLevel;
uint32_t temporaryStorageOffset[NBL_BUILTIN_MAX_LEVELS/2];
};

Parameters_t getParameters();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have a forward declaration for this!?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO you need a constructor of Parameters_t that is able to work all those out from simple elementCount + itemsPerWorkgroup


#ifndef _NBL_HLSL_SCAN_SET_DATA_DECLARED_
namespace nbl
{
namespace hlsl
{
namespace scan
{
template<typename Storage_t>
void setData(
in Storage_t data,
in uint levelInvocationIndex,
in uint localWorkgroupIndex,
in uint treeLevel,
in uint pseudoLevel,
in bool inRange
);
struct DefaultSchedulerParameters_t
{
uint32_t cumulativeWorkgroupCount[NBL_BUILTIN_MAX_LEVELS];
uint32_t workgroupFinishFlagsOffset[NBL_BUILTIN_MAX_LEVELS];
uint32_t lastWorkgroupSetCountForLevel[NBL_BUILTIN_MAX_LEVELS];
Comment on lines +33 to +35

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure you need either cumulativeWorkgroupCount or lastWorkgroupSetCountForLevel for a "last one out closes the door" upsweep


};

DefaultSchedulerParameters_t getSchedulerParameters();

template<typename Storage_t, bool isExclusive=false>
void getData(
NBL_REF_ARG(Storage_t) data,
NBL_CONST_REF_ARG(uint32_t) levelInvocationIndex,
NBL_CONST_REF_ARG(uint32_t) localWorkgroupIndex,
NBL_CONST_REF_ARG(uint32_t) treeLevel,
NBL_CONST_REF_ARG(uint32_t) pseudoLevel
);

template<typename Storage_t, bool isScan>
void setData(
NBL_CONST_REF_ARG(Storage_t) data,
NBL_CONST_REF_ARG(uint32_t) levelInvocationIndex,
NBL_CONST_REF_ARG(uint32_t) localWorkgroupIndex,
NBL_CONST_REF_ARG(uint32_t) treeLevel,
NBL_CONST_REF_ARG(uint32_t) pseudoLevel,
NBL_CONST_REF_ARG(bool) inRange
);

}
}
}
#define _NBL_HLSL_SCAN_SET_DATA_DECLARED_
#endif

#endif
370 changes: 173 additions & 197 deletions include/nbl/builtin/hlsl/scan/default_scheduler.hlsl

Large diffs are not rendered by default.

117 changes: 116 additions & 1 deletion include/nbl/builtin/hlsl/scan/descriptors.hlsl
Original file line number Diff line number Diff line change
@@ -1,3 +1,118 @@
// Copyright (C) 2023 - DevSH Graphics Programming Sp. z O.O.
// This file is part of the "Nabla Engine".
// For conditions of distribution and use, see copyright notice in nabla.h

#ifndef _NBL_HLSL_SCAN_DESCRIPTORS_INCLUDED_
#define _NBL_HLSL_SCAN_DESCRIPTORS_INCLUDED_

// choerent -> globallycoherent
#include "nbl/builtin/hlsl/scan/declarations.hlsl"
#include "nbl/builtin/hlsl/workgroup/basic.hlsl"

// coherent -> globallycoherent

namespace nbl
{
namespace hlsl
{
namespace scan
{

template<uint32_t dataElementCount=SCRATCH_EL_CNT - NBL_BUILTIN_MAX_LEVELS>
struct Scratch
{
uint32_t reduceResult;
uint32_t workgroupsStarted[NBL_BUILTIN_MAX_LEVELS];
uint32_t data[dataElementCount];
};

[[vk::binding(0 ,0)]] RWStructuredBuffer<Storage_t> scanBuffer; // (REVIEW): Make the type externalizable. Decide how (#define?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Buffer Device Address

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively you can use accessor pattern

[[vk::binding(1 ,0)]] RWStructuredBuffer<Scratch> /*globallycoherent (seems we can't use along with VMM)*/ scanScratchBuf; // (REVIEW): Check if globallycoherent can be used with Vulkan Mem Model

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one you need to use BDA

globallycoherent can't be used with VMM, but DXc doesn't support/emit/upgrade to VMM

IIRC you can mark individual load/store as coherent even before VMM (just no acquire/release cause thats VMM) with spir-v intrinsics and that should be enough

Also because the scratch needs to be coherent, it only makes sense to come from a buffer, and you might as well use BDA for it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without VMM you can use Release/Acquire/SeqCst

The only difference between pre and post VMM, is because Volatile used to be a Memory Operand, and in VMM its a MEmory Semantic.

https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_memory_semantics_id


template<typename Storage_t, bool isExclusive>
void getData(
NBL_REF_ARG(Storage_t) data,
NBL_CONST_REF_ARG(uint32_t) levelInvocationIndex,
NBL_CONST_REF_ARG(uint32_t) levelWorkgroupIndex,
NBL_CONST_REF_ARG(uint32_t) treeLevel
)
{
const Parameters_t params = getParameters(); // defined differently for direct and indirect shaders

uint32_t offset = levelInvocationIndex;
const bool notFirstOrLastLevel = bool(treeLevel);
if (notFirstOrLastLevel)
offset += params.temporaryStorageOffset[treeLevel-1u];

//if (pseudoLevel!=treeLevel) // downsweep/scan
//{
// const bool firstInvocationInGroup = workgroup::SubgroupContiguousIndex()==0u;
// if (bool(levelWorkgroupIndex) && firstInvocationInGroup)
// data = scanScratchBuf[0].data[levelWorkgroupIndex+params.temporaryStorageOffset[treeLevel]];
//
// if (notFirstOrLastLevel)
// {
// if (!firstInvocationInGroup)
// data = scanScratchBuf[0].data[offset-1u];
// }
// else
// {
// if(isExclusive)
// {
// if (!firstInvocationInGroup)
// data += scanBuffer[offset-1u];
// }
// else
// {
// data += scanBuffer[offset];
// }
// }
//}
//else
//{
if (notFirstOrLastLevel)
data = scanScratchBuf[0].data[offset];
else
data = scanBuffer[offset];
//}
}

template<typename Storage_t, bool isScan>
void setData(
NBL_CONST_REF_ARG(Storage_t) data,
NBL_CONST_REF_ARG(uint32_t) levelInvocationIndex,
NBL_CONST_REF_ARG(uint32_t) levelWorkgroupIndex,
NBL_CONST_REF_ARG(uint32_t) treeLevel,
NBL_CONST_REF_ARG(bool) inRange
)
{
const Parameters_t params = getParameters();
if (!isScan && treeLevel<params.topLevel) // is reduce and we're not at the last level (i.e. we still save into scratch)
{
const bool lastInvocationInGroup = workgroup::SubgroupContiguousIndex()==(glsl::gl_WorkGroupSize().x-1u);
if (lastInvocationInGroup)
scanScratchBuf[0u].data[levelWorkgroupIndex+params.temporaryStorageOffset[treeLevel]] = data;
}
else if (inRange)
{
if (!isScan && treeLevel == params.topLevel)
{
scanScratchBuf[0u].reduceResult = data;
}
// The following only for isScan == true
else if (bool(treeLevel))
{
const uint32_t offset = params.temporaryStorageOffset[treeLevel-1u];
scanScratchBuf[0].data[levelInvocationIndex+offset] = data;
}
else
{
scanBuffer[levelInvocationIndex] = data;
}
}
}

}
}
}

#endif
100 changes: 71 additions & 29 deletions include/nbl/builtin/hlsl/scan/direct.hlsl
Original file line number Diff line number Diff line change
@@ -1,50 +1,92 @@
#ifndef _NBL_HLSL_WORKGROUP_SIZE_
#define _NBL_HLSL_WORKGROUP_SIZE_ 256
#endif
// Copyright (C) 2023 - DevSH Graphics Programming Sp. z O.O.
// This file is part of the "Nabla Engine".
// For conditions of distribution and use, see copyright notice in nabla.h
#pragma shader_stage(compute)

#include "nbl/builtin/hlsl/scan/descriptors.hlsl"
#include "nbl/builtin/hlsl/functional.hlsl"
#include "nbl/builtin/hlsl/glsl_compat/core.hlsl"
#include "nbl/builtin/hlsl/workgroup/scratch_size.hlsl"
#include "nbl/builtin/hlsl/scan/declarations.hlsl"
#include "nbl/builtin/hlsl/scan/virtual_workgroup.hlsl"
#include "nbl/builtin/hlsl/scan/default_scheduler.hlsl"

// ITEMS_PER_WG = WORKGROUP_SIZE
static const uint32_t SharedScratchSz = nbl::hlsl::workgroup::scratch_size_arithmetic<WORKGROUP_SIZE>::value;

// TODO: Can we make it a static variable?
groupshared uint32_t wgScratch[SharedScratchSz];

#include "nbl/builtin/hlsl/workgroup/arithmetic.hlsl"

template<uint16_t offset>
struct WGScratchProxy
{
void get(const uint32_t ix, NBL_REF_ARG(uint32_t) value)
{
value = wgScratch[ix+offset];
}
void set(const uint32_t ix, const uint32_t value)
{
wgScratch[ix+offset] = value;
}

uint32_t atomicAdd(uint32_t ix, uint32_t val)
{
return nbl::hlsl::glsl::atomicAdd(wgScratch[ix + offset], val);
}

void workgroupExecutionAndMemoryBarrier()
{
nbl::hlsl::glsl::barrier();
//nbl::hlsl::glsl::memoryBarrierShared(); implied by the above
}
};
static WGScratchProxy<0> accessor;

// https://github.com/microsoft/DirectXShaderCompiler/issues/6144
uint32_t3 nbl::hlsl::glsl::gl_WorkGroupSize() {return uint32_t3(WORKGROUP_SIZE,1,1);}

struct ScanPushConstants
{
nbl::hlsl::scan::Parameters_t scanParams;
nbl::hlsl::scan::DefaultSchedulerParameters_t schedulerParams;
};

[[vk::push_constant]]
ScanPushConstants spc;
Comment on lines +48 to +55

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything affecting the pipeline layout should be userspace


/**
* Required since we rely on SubgroupContiguousIndex instead of
* gl_LocalInvocationIndex which means to match the global index
* we can't use the gl_GlobalInvocationID but an index based on
* SubgroupContiguousIndex.
*/
uint32_t globalIndex()
{
return nbl::hlsl::glsl::gl_WorkGroupID().x*WORKGROUP_SIZE+nbl::hlsl::workgroup::SubgroupContiguousIndex();
}

namespace nbl
{
namespace hlsl
{
namespace scan
{
#ifndef _NBL_HLSL_SCAN_PUSH_CONSTANTS_DEFINED_
cbuffer PC // REVIEW: register and packoffset selection
{
Parameters_t scanParams;
DefaultSchedulerParameters_t schedulerParams;
};
#define _NBL_HLSL_SCAN_PUSH_CONSTANTS_DEFINED_
#endif

#ifndef _NBL_HLSL_SCAN_GET_PARAMETERS_DEFINED_
Parameters_t getParameters()
{
return pc.scanParams;
return spc.scanParams;
}
#define _NBL_HLSL_SCAN_GET_PARAMETERS_DEFINED_
#endif

#ifndef _NBL_HLSL_SCAN_GET_SCHEDULER_PARAMETERS_DEFINED_
DefaultSchedulerParameters_t getSchedulerParameters()
{
return pc.schedulerParams;
return spc.schedulerParams;
}
#define _NBL_HLSL_SCAN_GET_SCHEDULER_PARAMETERS_DEFINED_
#endif

}
}
}

#ifndef _NBL_HLSL_MAIN_DEFINED_
[numthreads(_NBL_HLSL_WORKGROUP_SIZE_, 1, 1)]
void CSMain()
[numthreads(WORKGROUP_SIZE,1,1)]
void main()
{
nbl::hlsl::scan::main();
}
#define _NBL_HLSL_MAIN_DEFINED_
#endif
nbl::hlsl::scan::main<BINOP<Storage_t>, Storage_t, IS_SCAN, IS_EXCLUSIVE, uint16_t(WORKGROUP_SIZE), WGScratchProxy<0> >(accessor);
}
Loading