A case study on our API, friends, free functions #16536
Replies: 3 comments 11 replies
-
The pingpong-ing between the object itself and external entity is ..wow. Personally I prefer API style 2. ComputeKernel& kernel = program.add_compute_kernel(kernel_src, core_range_set, config);
kernel.set_runtime_args(core, {values_buffer->address(), index_buffer->address()}); Because:
However I see some potential problem (with one or both API)
I don't know how much impact the above items have. but I think the following design addresses the issues. However, I don't know the codebase enough to judge correctly. Program program;
// Represents a compiled kernel. Since kernels only depend on the underlying device arch. And that is known at runtime
Kernel kernel("/path/to/your/kernel.cpp", config);
// if we wish to support multiple arch at the same time (i.e. ARCH_NAME is removed as a env variable.
// Kernel kernel = device->create_kernel("/path/to/your/kernel.cpp", config)
// Store runtime info in program not the kernel. So kernel can be reused if needed
program.use_kernel(kernel, cores, {values_buffer->address(), index_buffer->address()});
// If per-core args are needed. Either reuse the kernel where the args are different
for(....) {
program.use_kernel(kernel. core_x_y, {i, ...}); // the easy way
}
// .. or explicitly modify the values
// program.set_args(core_range, {...}); // This is easy to screw up. But fine because it's uncommon |
Beta Was this translation helpful? Give feedback.
-
Hopefully making these changes will help speed up development of the software for future architectures. It would be good to consider how the code should evolve, especially with the next generation of hardware. |
Beta Was this translation helpful? Give feedback.
-
One thing to be mindful of that the current design of
I agree that the kernel and program objects carry too many responsibilities and I think one way to combat this is to tease out as much immutable state up front as possible. Programs/kernels are first built and should become immutable objects. Separately, setting runtime arguments should be a stateful operation that is recorded into the command queue, either via some (cleaned up) flavor of Implementations from other APIsOpenCLhttps://www.khronos.org/files/opencl30-reference-guide.pdf Build immutable program object: cl_program clCreateProgramWithSource (
cl_context context, cl_uint count, const char **strings,
const size_t *lengths, cl_int *errcode_ret)
cl_int clBuildProgram (cl_program program,
cl_uint num_devices, const cl_device_id *device_list,
const char *options, void (CL_CALLBACK*pfn_notify)
(cl_program program, void *user_data),
void *user_data) Create stateful kernel + arguments object that can be enqueued: cl_kernel clCreateKernel (cl_program program,
const char *kernel_name, cl_int *errcode_ret)
cl_int clSetKernelArg (cl_kernel kernel,
cl_uint arg_index, size_t arg_size,
const void *arg_value)
cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,
cl_kernel kernel,
...) Apple Metal APIBuild immutable program object: id<MTLFunction> addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];
MTLComputePipelineState addFunctionPSO = [_mDevice newComputePipelineStateWithFunction:addFunction error:&error]; Do stateful argument update: id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:_mAddFunctionPSO];
[computeEncoder setBuffer:argBufferA offset:0 atIndex:0];
[computeEncoder setBuffer:argBufferB offset:0 atIndex:1];
[computeEncoder setBuffer:argBufferResult offset:0 atIndex:2]; |
Beta Was this translation helpful? Give feedback.
-
A case study on our API, friends, free functions
This case study is based on the review of TT-NN operations, Program, and Device APIs.
When users introduce a new TT-NN operation, they write a program factory, which creates and configures the program and kernels. Here is an interesting typical piece (topk example):
Here we
Let's look at the code participating in this flow.
CreateKernel
is a handy proxy that decides which kernel to create based on the user-provided Config class:Then we create the actual kernel and add it to the program, not returning a kernel, but returning a handle to it:
This friend calls to pimpl:
For context, this is how Program looks:
Alright, now that we have the kernel added to the program, we need to provide runtime arguments.
But because we don't have kernels at hand, we have to first get it.
Getting it via another friend here:
Looking inside
std::vector<std::unordered_map<KernelHandle, std::shared_ptr<Kernel> >> kernels_;
:Wow.
Can we make it straight?
Here is what this code could be. No extra code participates in the flow.
or
From the first glance it looks easier to maintain, easier to debug and faster.
But it does not make sense to set runtime arguments to the kernel!
We need to look deeper ✨
Why our code looks like this?
Alright, lets get real. Why that code exist? Look at some of our classes:
Device
has ~130 public methods, about 60-80 are used externally, the rest are only used (or intended for use only) by other metal runtime entities like DevicePool.Kernel
has ~50, only a couple are supposed for external use, the rest is there for the sytemProgram
has ~50, same as above.We don't like this massive API surface, it is hard to maintain and hard to change. It exposes things which we think of as internal to the metal layer. We want that fixed. So we use
free functions
,friends
andopaque handle
to draw some lines between APIs that serve needs of internal (within Metal) and external (above Metal, e.g. TT-NN) consumers. And we expect that in the limit this allows to completely hide the mess and give users a tidy API likeSetRuntimeArgs(program, binary_writer_kernel_id, core, args)
.As anything, this has a cost. It is not all white or black, just some things to keep in mind:
So far we were ready to pay this cost.
Today I look at our code and ask - is it worth it?
The Issue
The problem is that we have inadequate abstractions. To protect users we reframe it to a problem of "internal vs external" access. However, this approach has proven to be disproportionately costly. Not only are we failing to adequately shield users, but we're also not making meaningful progress in addressing the core issue.
Take a look at a typical pipeline:
Look at the Kernel class:
All these methods are a part of a Kernel today.
Same with Device and Program. We mix multiple responsibilities into a single class and now language/build tools do not suffice. We have options that provide different dynamics, both have a high cost.
I propose us to deeper explore the option 2. If we treat a class as an atomic element in the design, accepting that its API can't be split into internal/external, we can avoid unnecessary complexity. This principle should guide ✨ the system's design. Excellent example in the comments
What's next?
Beta Was this translation helpful? Give feedback.
All reactions