A case study on our API, friends, free functions #16536

ayerofieiev-tt · 2025-01-08T18:53:36Z

ayerofieiev-tt
Jan 8, 2025
Maintainer

A case study on our API, friends, free functions

This case study is based on the review of TT-NN operations, Program, and Device APIs.

When users introduce a new TT-NN operation, they write a program factory, which creates and configures the program and kernels. Here is an interesting typical piece (topk example):

tt::tt_metal::Program program{};
...
tt::tt_metal::KernelHandle binary_writer_kernel_id = tt::tt_metal::CreateKernel(
    program,
    "ttnn/cpp/ttnn/operations/reduction/topk/device/kernels/dataflow/writer_binary_interleaved.cpp",
    core,
    tt::tt_metal::WriterDataMovementConfig(writer_compile_time_args));
SetRuntimeArgs(program, binary_writer_kernel_id, core, {values_buffer->address(), index_buffer->address()});

Here we

Create a program
Create a kernel and add it to the program
Set kernel runtime arguments

Let's look at the code participating in this flow.

CreateKernel is a handy proxy that decides which kernel to create based on the user-provided Config class:

KernelHandle CreateKernel(
    Program& program,
    const std::string& file_name,
    const std::variant<CoreCoord, CoreRange, CoreRangeSet>& core_spec,
    const std::variant<DataMovementConfig, ComputeConfig, EthernetConfig>& config) {
    return std::visit(
        [&](auto&& cfg) -> KernelHandle {
            CoreRangeSet core_ranges = GetCoreRangeSet(core_spec);
            KernelSource kernel_src(file_name, KernelSource::FILE_PATH);
            using T = std::decay_t<decltype(cfg)>;
            if constexpr (std::is_same_v<T, DataMovementConfig>) {
                return CreateDataMovementKernel(program, kernel_src, core_ranges, cfg);
            } else if constexpr (std::is_same_v<T, ComputeConfig>) {
                return CreateComputeKernel(program, kernel_src, core_ranges, cfg);
            } else if constexpr (std::is_same_v<T, EthernetConfig>) {
                return CreateEthernetKernel(program, kernel_src, core_ranges, cfg);
            }
        },
        config);
}

Then we create the actual kernel and add it to the program, not returning a kernel, but returning a handle to it:

KernelHandle CreateComputeKernel(
    Program& program, const KernelSource& kernel_src, const CoreRangeSet& core_range_set, const ComputeConfig& config) {
    std::shared_ptr<Kernel> kernel = std::make_shared<ComputeKernel>(kernel_src, core_range_set, config);
    return detail::AddKernel(program, kernel, HalProgrammableCoreType::TENSIX);
}

This friend calls to pimpl:

KernelHandle AddKernel (Program &program, const std::shared_ptr<Kernel>& kernel, const HalProgrammableCoreType core_type) {
    return program.pimpl_->add_kernel(std::move(kernel), core_type);
}

For context, this is how Program looks:

Program {
public:
    ...
    std::shared_ptr<Kernel> get_kernel(KernelHandle kernel_id) const;
    ...

private:
    ..
    friend KernelHandle detail::AddKernel(Program &program, const std::shared_ptr<Kernel>& kernel, const HalProgrammableCoreType core_type);
    friend std::shared_ptr<Kernel> detail::GetKernel(const Program &program, KernelHandle kernel_id);

}

Alright, now that we have the kernel added to the program, we need to provide runtime arguments.
But because we don't have kernels at hand, we have to first get it.

void SetRuntimeArgs(
    ProgramHandle &program, KernelHandle kernel, const CoreRangeSet &core_spec, RuntimeArgs runtime_args) {
    if (runtime_args.empty()) {
        return;
    }

    const auto kernel_ptr = detail::GetKernel(program, static_cast<tt_metal::KernelHandle>(kernel));

    for (const auto &core_range : core_spec.ranges()) {
        for (auto x = core_range.start_coord.x; x <= core_range.end_coord.x; ++x) {
            for (auto y = core_range.start_coord.y; y <= core_range.end_coord.y; ++y) {
                kernel_ptr->set_runtime_args(CoreCoord(x, y), runtime_args);
            }
        }
    }
}

Getting it via another friend here:

std::shared_ptr<Kernel> GetKernel(const Program &program, KernelHandle kernel_id) {
    return program.get_kernel(kernel_id);
}

Looking inside std::vector<std::unordered_map<KernelHandle, std::shared_ptr<Kernel> >> kernels_;:

std::shared_ptr<Kernel> detail::Program_::get_kernel(KernelHandle kernel_id) const {
    // TT_ASSERT(kernel_id < this->kernels_.size(), "Expected Kernel with ID {} to be in Program {}", kernel_id,
    // this->id);
    //  find coretype based on kernel_id
    for (const auto &kernels : this->kernels_) {
        if (kernels.find(kernel_id) != kernels.end()) {
            return kernels.at(kernel_id);
        }
    }

    TT_ASSERT(false, "Did not find kernel id across all core types!");
    return nullptr;
}

Wow.

Can we make it straight?

Here is what this code could be. No extra code participates in the flow.

ComputeKernel kernel(kernel_src, core_range_set, config);
kernel.set_runtime_args(core, {values_buffer->address(), index_buffer->address()});
auto pk_id = program.add_kernel(std::move(kernel));

or

ComputeKernel& kernel = program.add_compute_kernel(kernel_src, core_range_set, config);
kernel.set_runtime_args(core, {values_buffer->address(), index_buffer->address()});

From the first glance it looks easier to maintain, easier to debug and faster.
But it does not make sense to set runtime arguments to the kernel!
We need to look deeper ✨

Why our code looks like this?

Alright, lets get real. Why that code exist? Look at some of our classes:

Device has ~130 public methods, about 60-80 are used externally, the rest are only used (or intended for use only) by other metal runtime entities like DevicePool.
Kernel has ~50, only a couple are supposed for external use, the rest is there for the sytem
Program has ~50, same as above.

We don't like this massive API surface, it is hard to maintain and hard to change. It exposes things which we think of as internal to the metal layer. We want that fixed. So we use free functions, friends and opaque handle to draw some lines between APIs that serve needs of internal (within Metal) and external (above Metal, e.g. TT-NN) consumers. And we expect that in the limit this allows to completely hide the mess and give users a tidy API like SetRuntimeArgs(program, binary_writer_kernel_id, core, args).

As anything, this has a cost. It is not all white or black, just some things to keep in mind:

We grow the amount of code to maintain
We trade performance
We make debugging more obscure, complicated
We make codebase more error prone
We make API less discoverable
We don't address the quality of our internal abstractions fast enough
Ultimately, we slow down development

So far we were ready to pay this cost.
Today I look at our code and ask - is it worth it?

The Issue

The problem is that we have inadequate abstractions. To protect users we reframe it to a problem of "internal vs external" access. However, this approach has proven to be disproportionately costly. Not only are we failing to adequately shield users, but we're also not making meaningful progress in addressing the core issue.

Take a look at a typical pipeline:

Choose a source code
Point what sources should be composed in the program
Build the program with a set of compile options, including defines
Cache compiled binaries
Run the built program with a set of runtime arguments

Look at the Kernel class:

What does Kernel represent today?
Why does it know about its binaries?
Why does it know about runtime arguments?

All these methods are a part of a Kernel today.

std::vector<ll_api::memory const*> const& binaries(uint32_t build_key) const;
virtual bool configure(IDevice* device, const CoreCoord &logical_core, uint32_t base_address, const uint32_t offsets[]) const = 0;
virtual void generate_binaries(IDevice* device, JitBuildOptions &build_options) const = 0;
uint32_t get_binary_packed_size(IDevice* device, int index) const;
uint32_t get_binary_text_size(IDevice* device, int index) const;
void set_binary_path(const std::string &binary_path) { binary_path_ = binary_path; }
void set_binaries(uint32_t build_key, std::vector<ll_api::memory const*>&& binaries);
virtual void read_binaries(IDevice* device) = 0;
void set_runtime_args(const CoreCoord &logical_core, stl::Span<const uint32_t> runtime_args);
void set_common_runtime_args(stl::Span<const uint32_t> runtime_args);

Same with Device and Program. We mix multiple responsibilities into a single class and now language/build tools do not suffice. We have options that provide different dynamics, both have a high cost.

Create a separate public interface in a form of free functions and hide the object behind an opaque handle (1)
- This led to unintended consequences, such as increased complexity and reduced performance, consumer dissatisfaction
Revisit the design (2)
- We know this can't be done overnight
- But say 3 months? You bet significant improvements can be made

I propose us to deeper explore the option 2. If we treat a class as an atomic element in the design, accepting that its API can't be split into internal/external, we can avoid unnecessary complexity. This principle should guide ✨ the system's design. Excellent example in the comments

What's next?

Acknowledge the nature of the problem, our current practices, their cost and consequences.
Explore to proposed guiding principle, focusing on specific cases
- Prefer extracting separate responsibilities from existing classes to cutting API of those classes to internal/external with friends and external methods.

marty1885 · 2025-01-09T04:25:00Z

marty1885
Jan 9, 2025

The pingpong-ing between the object itself and external entity is ..wow.

Personally I prefer API style 2.

ComputeKernel& kernel = program.add_compute_kernel(kernel_src, core_range_set, config);
kernel.set_runtime_args(core, {values_buffer->address(), index_buffer->address()});

Because:

Kernel in programs are easier to track
Enables failing early if overwriting existing kernels in program, instead of checking in add_kernel which introduces delay
The most common thing needed is easy and fast

However I see some potential problem (with one or both API)

How do I set per core kernel arguments?
With design 2, kernels are bound to a program thus (in the programming model) cannot be created and reused across programs
- is this a concern?
I don't like the kernel holding runtime args. It requires manual management of the program state so the hardware fits the software interface
- Which is extra code and not correct by default
- Unlike OpenCL, Metalium's relation between program and kernels is inverted

I don't know how much impact the above items have. but I think the following design addresses the issues. However, I don't know the codebase enough to judge correctly.

Program program;

// Represents a compiled kernel. Since kernels only depend on the underlying device arch. And that is known at runtime
Kernel kernel("/path/to/your/kernel.cpp", config);
// if we wish to support multiple arch at the same time (i.e. ARCH_NAME is removed as a env variable.
// Kernel kernel = device->create_kernel("/path/to/your/kernel.cpp", config)

// Store runtime info in program not the kernel. So kernel can be reused if needed
program.use_kernel(kernel, cores, {values_buffer->address(), index_buffer->address()});

// If per-core args are needed. Either reuse the kernel where the args are different
for(....) {
    program.use_kernel(kernel. core_x_y, {i, ...}); // the easy way
}
// .. or explicitly modify the values
// program.set_args(core_range, {...}); // This is easy to screw up. But fine because it's uncommon

2 replies

ayerofieiev-tt Jan 9, 2025
Maintainer Author

@marty1885 , agreed. I like this train of thought.

We are literally building a small compiler pipeline. Not inventing anything new:

Choose a source code
Point what sources should be composed in the program
Build that program with a set of compile options, including defines
Cache compiled binaries
Run the built program with a set of runtime arguments

What does Kernel represent today?
Why does it know about its binaries?
Why does it know about runtime arguments?

All these methods are a part of a Kernel today. It carries too many responsibilities.

std::vector<ll_api::memory const*> const& binaries(uint32_t build_key) const;
virtual bool configure(IDevice* device, const CoreCoord &logical_core, uint32_t base_address, const uint32_t offsets[]) const = 0;
virtual void generate_binaries(IDevice* device, JitBuildOptions &build_options) const = 0;
uint32_t get_binary_packed_size(IDevice* device, int index) const;
uint32_t get_binary_text_size(IDevice* device, int index) const;
void set_binary_path(const std::string &binary_path) { binary_path_ = binary_path; }
void set_binaries(uint32_t build_key, std::vector<ll_api::memory const*>&& binaries);
virtual void read_binaries(IDevice* device) = 0;
void set_runtime_args(const CoreCoord &logical_core, stl::Span<const uint32_t> runtime_args);
void set_common_runtime_args(stl::Span<const uint32_t> runtime_args);

And it is not just about a Kernel. The same approach is applicable to Device, Program and other classes.

abhullar-tt Jan 9, 2025
Collaborator

@pgkeller and @spoojaryTT wanted to bring your attention to this since I know @spoojaryTT is porting some of the build state out of tt_metal::Device not sure if there are any existing discussions regarding binary compilation + Kernel object

bbradelTT · 2025-01-09T16:52:57Z

bbradelTT
Jan 9, 2025
Collaborator

Hopefully making these changes will help speed up development of the software for future architectures. It would be good to consider how the code should evolve, especially with the next generation of hardware.

0 replies

nsmithtt · 2025-01-09T18:23:55Z

nsmithtt
Jan 9, 2025
Collaborator

One thing to be mindful of that the current design of SetRuntimeArgs is that it makes a bit more explicit the separation of program/kernel creation and the setting of runtime arguments. Where:

Program/kernel creation happens once and should become an opaque and stateless entity (though this isn't true today).
SetRuntimeArgs should record a stateful operation into the argument buffer for the subsequent EnqueueProgram.

I agree that the kernel and program objects carry too many responsibilities and I think one way to combat this is to tease out as much immutable state up front as possible. Programs/kernels are first built and should become immutable objects. Separately, setting runtime arguments should be a stateful operation that is recorded into the command queue, either via some (cleaned up) flavor of SetRuntimeArgs or as part of the EnqueueProgram api via some RuntimeArgs descriptor.

Implementations from other APIs

OpenCL

https://www.khronos.org/files/opencl30-reference-guide.pdf

Build immutable program object:

cl_program clCreateProgramWithSource (
  cl_context context, cl_uint count, const char **strings,
  const size_t *lengths, cl_int *errcode_ret)

cl_int clBuildProgram (cl_program program,
  cl_uint num_devices, const cl_device_id *device_list,
  const char *options, void (CL_CALLBACK*pfn_notify)
  (cl_program program, void *user_data),
  void *user_data)

Create stateful kernel + arguments object that can be enqueued:

cl_kernel clCreateKernel (cl_program program,
  const char *kernel_name, cl_int *errcode_ret)

cl_int clSetKernelArg (cl_kernel kernel,
  cl_uint arg_index, size_t arg_size,
  const void *arg_value)

cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,
  cl_kernel kernel,
  ...)

Apple Metal API

Build immutable program object:

id<MTLFunction> addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];
MTLComputePipelineState addFunctionPSO = [_mDevice newComputePipelineStateWithFunction:addFunction error:&error];

Do stateful argument update:

id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:_mAddFunctionPSO];
[computeEncoder setBuffer:argBufferA offset:0 atIndex:0];
[computeEncoder setBuffer:argBufferB offset:0 atIndex:1];
[computeEncoder setBuffer:argBufferResult offset:0 atIndex:2];

9 replies

marty1885 Jan 10, 2025

Quick Q: Why is core ranges bound to kernels? AFAIK kernels are just a set of RISC-V binaries compiled with specific compilation flags. It should have no coupling with ranges. I think I'm wrong.

I agree immutable programs are a good idea.

nsmithtt Jan 10, 2025
Collaborator

@marty1885 this is a good point, I'm not 100% sure the API necessarily needs to be formed this way, i.e. I don't think any core range info is required for kernel compilation (@abhullar-tt is this true?).

That said, it is legal to compile multiple compute kernels that belong to non-overlapping grid regions and that participate in the same Program. The most extreme example of this, given an 8x8 device grid, you could compile 64 unique compute kernels that each span a 1x1 core range. Every core would be running its own kernel as part of this single Program.

From an API perspective, it's certainly simpler and easier to reason about if the grid ranges are specified statically during program creation like this, but I'm actually not sure if there is a real constraint here, in theory I think the grid ranges for each program could be dynamically formed. Perhaps that we should consider leaving the door open for this.

pgkeller Jan 10, 2025
Collaborator

Read the above, random comments:

Set/update runtime args and execute trace are the two top perf paths through the code base
Looking up a kernel through a hash to update runtime args is...bad
I don't like specifying RTAs at EnqueueProgram. In the end I suspect we'll have multiple RTA update APIs since RTA updates are expensive and sometimes no RTAs need to change, sometimes they all change and sometimes just a few change. We may also want a "Buffer Object" type implementation where the client can write the RTAs directly to DMAable memory (rather than baking them into the command stream - which will require managing the async behavior, so TBD)
There is also the common runtime args API. In theory, this is great as we can reduce the work of iterating over cores to send RTAs. In practice I don't think it is used and if it is not used carefully it will be a decelerator (think: mcast 1 common RTA and then unicast 1 unique RTA, slowdown. vs change unicasting 1 RTA to 1 mcast, win). Common RTAs need to know the cores affected and are tied to the kernel (just something to keep in mind).
Kernel builds do not depend on cores, these can be decoupled. Decoupling this will make it cleaner to support multiple "supply the kernel" APIs, eg, just in time build like we have now and offline pre-built kernels

marty1885 Jan 10, 2025

I think decoupling core range and kernel has significant advantages. See my comment above. It's one less unnecessary responsibly, decoupling kernels from program (as it should be, since a program is a set of placed kernels) and makes specifying SPMD arguments sane. It also reduces the cost in cases for running the same op with different grid size (maybe conv and SDPA can utilize this?) since the kernel has only be constructed one now, instead of multiple time from cache.

nsmithtt Jan 10, 2025
Collaborator

@pgkeller, responding to some of your points:

Not surprising :)
Agreed, we should use a bind point index or handle instead of a string like my example. It was just easier to illustrate the point w/ a name.
I'm largely indifferent as to if we keep SetRuntimeArgs APIs or if they are provided as part of the EnqueueProgram. Regarding perf concern, would it be feasible to always do indirect style argument passing (using graphics terminology). e.g. modifying my above code:

RuntimeArgumentsSpec runtimeArgsSpecA = {
    {0 /* argument index 0 */, RuntimeArgumentMode::PerCore},
    {1 /* argument index 1 */, RuntimeArgumentMode::MultiCast},
};
RuntimeArgumentsSpec runtimeArgsSpecB = {
    {0 /* argument index 0 */, RuntimeArgumentMode::MultiCast},
    {1 /* argument index 1 */, RuntimeArgumentMode::MultiCast},
};
// Call this once during setup
RuntimeArgumentsBuffer runtimeArgsBufA =
    device.create_kernel_runtime_arguments_buffer(program, "a", runtimeArgsSpecA, core_range);
RuntimeArgumentsBuffer runtimeArgsBufB =
    device.create_kernel_runtime_arguments_buffer(program, "b", runtimeArgsSpecB, core_range);

// Some runtime args can be set once outside of the loop and/or lazily updated
for i, core_coord in enumerate(core_range):
    runtimeArgsBufA.set(0, i, core_coord); // set kernel "a", argument index 0, to value i, for core "core_coord"
runtimeArgsBufB.set(0, 456); // set kernel "b", argument index 0, to value 456, no core_coord specified for multicast args

for loop_idx in range(num_loops):
    // Some runtime args need to be updated every iteration
    runtimeArgsBufB.set(1, loop_idx); // set kernel "b", argument index 1, to value loop_idx, no core_coord specified for multicast args
    EnqueueProgram(cq, program, {{"a", runtimeArgsBufA}, {"b", runtimeArgsBufB}});  // List of pairs mapping kernel handle to corresponding buffer

RuntimeArgumentsBuffer could theoretically just be a metal buffer, but then we'd have to define layout, sizing / alignment requirements, which kernels reference which sections, etc. Encapsulating in an API makes it much easier to use and would enable us to change the layout under the hood in the future.

The kernel string names "a" and "b" are just illustrative, we could keep these as KernelHandle or a kernel index for better efficiency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A case study on our API, friends, free functions #16536

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

A case study on our API, friends, free functions #16536

ayerofieiev-tt Jan 8, 2025 Maintainer

A case study on our API, friends, free functions

Can we make it straight?

Why our code looks like this?

The Issue

What's next?

Replies: 3 comments · 11 replies

marty1885 Jan 9, 2025

ayerofieiev-tt Jan 9, 2025 Maintainer Author

abhullar-tt Jan 9, 2025 Collaborator

bbradelTT Jan 9, 2025 Collaborator

nsmithtt Jan 9, 2025 Collaborator

Implementations from other APIs

OpenCL

Apple Metal API

marty1885 Jan 10, 2025

nsmithtt Jan 10, 2025 Collaborator

pgkeller Jan 10, 2025 Collaborator

marty1885 Jan 10, 2025

nsmithtt Jan 10, 2025 Collaborator

ayerofieiev-tt
Jan 8, 2025
Maintainer

Replies: 3 comments 11 replies

marty1885
Jan 9, 2025

ayerofieiev-tt Jan 9, 2025
Maintainer Author

abhullar-tt Jan 9, 2025
Collaborator

bbradelTT
Jan 9, 2025
Collaborator

nsmithtt
Jan 9, 2025
Collaborator

nsmithtt Jan 10, 2025
Collaborator

pgkeller Jan 10, 2025
Collaborator

nsmithtt Jan 10, 2025
Collaborator