You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Porting of Thrust algorithms to CUB is blocked by a lack of VSMem abstraction. In Thrust, when a thread block can't fit temporary storage in the default shared memory size, it'll switch to using per-CTA global memory tiles. Introducing a pointer to shared memory in CUB kernels leads to performance regressions because generic loads are used instead of shared or global ones.
To avoid regressions, we must implement functionality allowing the processing of user-defined types of any size. It doesn't have to match the agent approach from Thrust or use global memory as long as the requirements are satisfied.
The content you are editing has changed. Please copy your edits and refresh the page.
Porting of Thrust algorithms to CUB is blocked by a lack of VSMem abstraction. In Thrust, when a thread block can't fit temporary storage in the default shared memory size, it'll switch to using per-CTA global memory tiles. Introducing a pointer to shared memory in CUB kernels leads to performance regressions because generic loads are used instead of shared or global ones.
To avoid regressions, we must implement functionality allowing the processing of user-defined types of any size. It doesn't have to match the agent approach from Thrust or use global memory as long as the requirements are satisfied.
Tasks
DeviceMergeSort
#549The text was updated successfully, but these errors were encountered: