Revamping Omniperf Architecture #153

coleramos425 · 2023-07-19T22:02:55Z

coleramos425
Jul 19, 2023
Maintainer

Background

Using this discussion space as an area to plan an upcoming "revamp" of Omniperf's source code.

Maintainers have been overjoyed by the number of users who have gotten value from this project. Omniperf, which originated as a proof-of-concept research project, is now widely used outside of its birthplace in AMD Research.

We'd like to take this as an opportunity to improve upon the tool's architecture to make contributing and maintaining easier for our community.

This is an open conversation around what improvements could be made to the architecture of Omniperf's source code

Contributing Guide

If you'd like to add a suggestion, please comment with the formatting details

Describe the suggestion
A clear and concise description of what aspect of Omniperf's source code you'd like changed.

Justification
How does this suggestion add value to the project, its maintainers, and/or users?

Implementation
Describe any ideas, steps, or processes that could be used to implement this suggestion into our repository.

Additional Notes
Any notes you'd like to add that were not mentioned above.

Update (9/12)

Thank you to those who shared suggestions in this discussion post. With your input, we've compiled a list of tasks that have been broken into two milestones:

Milestone 1: Focused on code reorganization and general cleanup
Milestone 2: Focused on adding desired functionality and enhancements

To maintain a regular release cycle and steady feature enhancements, our first sprint cycle will address Milestone 1. Once our feature branch is in a stable state, we'll begin prioritizing the new functionality included in Milestone 2.

For more detail on the specific changes being rolled out in each milestone along with up-to-date delivery estimates, please see our repo's Milestone page. Please feel free to star/watch our repository to monitor progress as we begin to roll these changes out 😄

coleramos425 · 2023-07-20T18:25:36Z

coleramos425
Jul 20, 2023
Maintainer Author

Describe the suggestion
Add MPI awareness to Omniperf

Justification
Adding MPI awareness is something we've been meaning to address and is highly requested by users. If Omniperf is MPI aware we can also begin to implement some clever ways to reduce the computational load by distributed counter collection (multi gpu scenario).

We've been holding off on this because we wanted to do it right. This seems like an appropriate opportunity to tackle implementation.

Implementation

Brute force approach would be to run our (~14x app replays) on each node. The profiling side of this method is straightforward, but post-processing could introduce some issues.
Alternatively, we could split runs up across nodes, assuming the same kernels are being launched.

Additional Notes

A potential gotcha to consider is the number of MPI ranks we advertise as being supported. Launching hundreds of ranks introduces data processing difficulties due to the raw amount of data generated

0 replies

skyreflectedinmirrors · 2023-07-20T19:54:41Z

skyreflectedinmirrors
Jul 20, 2023
Collaborator

Describe the suggestion
Use the rocprofiler API interface instead of doing <rocprof run foo>

Justification
The idea here is that we can be way more selective about which kernels we want to profile. For instance, we could give the users the mode to (attempt) to not replay the application at all, by e.g., cycling through various sets of counters to collect for successive launches of the 'same' kernel. This lines up with some of the stuff we've talking about internally re: kernel selection / cutting down replays.

Implementation

Hook into rocprofiler API such that we can cycle through selection of counter sets per instance of the same kernel to reduce need for replay, probably an opt-in mode

4 replies

skyreflectedinmirrors Jul 20, 2023
Collaborator

this actually feeds well into the MPI-aware stuff @coleramos425

skyreflectedinmirrors Jul 21, 2023
Collaborator

Also, now that I think about it, we'll be able to implement our own kernel filtering. E.g., we can do things like say "collect every 4th instance of kernel X matching regex FOO"

bwelton Aug 20, 2024
Collaborator

You absolutely would be able to do the "collect every 4th instance" sort of behavior in rocprofv3. I should also note, you would also be able to do sampling based approaches as well to potentially reduce the number of runs you need to collect a full counter set (i.e. you can select different counter sets for each kernel execution). If you need help on migrating this over, please touch base over teams.

feizheng10 Aug 22, 2024
Collaborator

@bwelton Thanks! We will definitely reach out you for more details.

skyreflectedinmirrors · 2023-07-21T13:03:32Z

skyreflectedinmirrors
Jul 21, 2023
Collaborator

Describe the suggestion
Pre-compile metrics at build time.

Justification
Metric compilation leads to a huge overhead in the analyze step (both on the CLI and standalone GUI). Further, the metrics themselves typically are unchanging for a given version after install (e.g., it's an experts-only thing to do). Therefore, we would greatly benefit from the ability to 'compile' the metrics are build/install time, such that they we can swap many, many calls to eval for loading a pre-compiled python module (say).

Implementation
I'm sure Fei has better ideas than me here, but one option would be to simply parse the metric and generate a corresponding python function to evaluate it (e.g., everything inside of the 'AVG', 'MAX', 'MIN', etc.). This can then be applied row-wise to a data-frame using apply. If needed, we could put all counter values into e.g., a row['<counter name>'] or some other uniform name for the row that can be passed to the function.

Additional Notes

Also, we should drop the separate metrics for 'AVG' v 'MIN' v 'MAX', etc., and just rely on pandas functions for these to cut down on unnecessary function calls / evals.

cc: @feizheng10

1 reply

skyreflectedinmirrors Aug 17, 2023
Collaborator

The note here:

Also, we should drop the separate metrics for 'AVG' v 'MIN' v 'MAX', etc., and just rely on pandas functions for these to cut down on unnecessary function calls / evals.

Likely would make enabling #163 much easier as well

skyreflectedinmirrors · 2023-07-21T13:15:06Z

skyreflectedinmirrors
Jul 21, 2023
Collaborator

Describe the suggestion

Consistent flag naming between modes.

Justification

I may use Omniperf more than... almost anyone?
And yet, I still forget which mode needs a --name, which mode needs a --path (that one must hard-code the arch into), and which needs a --workload (which needs the same path as --path, but is inexplicably different).
This is ... not a good user-experience.

Implementation

I suggest we audit all existing flags, to see which can be unified between modes. Similar problems exist for e.g., kernel selection (on profile it's a regex, on analyze it's a integer based on the top N chart), and blocks (hardware block name on profile, table # on output). For profiles where there's only one arch (the typical case), we don't ask the user to specify it at all (otherwise kick out an error saying "you have to give the full path", or something).

1 reply

skyreflectedinmirrors Jul 28, 2023
Collaborator

Also: let people use regex identifiers for kernel names!!!

skyreflectedinmirrors · 2023-07-26T19:53:18Z

skyreflectedinmirrors
Jul 26, 2023
Collaborator

Describe the suggestion
Build SMI-sampling to get more accurate clock rates

Justification
see here: #149 (comment). Basically, we'd ideally want a good average of clock rates from the profiler for each kernel. Note that 'other' profilers give the ability to either control the clock rate (subject to throttling) or actively report the number of clocks elapsed in various time domains.

Implementation
It can be as simple as spinning up a background thread to sample the clock rate during the app, but a more robust version would be able to assign clocks to specific kernels (e.g., if we control rocprofiler directly as well). We might also ask for enhanced rocprofiler support.

Additional Notes
We might also be able to do this by using (EndNs - BeginNs) / GRBM_GUI_ACTIVE to get an approximate clock rate?

2 replies

bwelton Aug 20, 2024
Collaborator

Rocprofiler in counter collection disables clock gating, if this was generalized is this enough to fix the clock rates for kernel execution?

skyreflectedinmirrors Aug 20, 2024
Collaborator

I'm not sure, tbh, we should discuss internally.

skyreflectedinmirrors · 2023-07-28T14:05:37Z

skyreflectedinmirrors
Jul 28, 2023
Collaborator

Describe the suggestion

Better normalization modes over multiple kernels.

Justification

In conversation with users, I have found that there is significant confusion over values that are presented when multiple kernels are selected for analysis. In particular, folks ask questions like "why did my bandwidth go down when I executed <10x more kernels>"?
My feeling is that the way we present normalization over multiple kernels is flawed.

Implementation

Options include:

universally take the same tact as the standalone GUI, and refuse to show details until the user has filtered to a specific kernel/dispatch (or group of them). Basically, unless there's some filter flag, don't show any details? I don't really love this, as it's a very weird UX to see only the kernel breakdown and not anything else, especially for existing users.
Use a time based normalization over the included kernels, instead of a simple count based one. That way, a user gets a more representative view of what the code was doing. This can also be implemented along-side the last one
Make more options to let the user switch between these modes, but this requires docs and code maintenance.

0 replies

skyreflectedinmirrors · 2023-07-28T15:17:36Z

skyreflectedinmirrors
Jul 28, 2023
Collaborator

Describe the suggestion

Better metric dependency resolution and reuse

Justification

Enable re-use of intermediate computations and metrics to avoid code-duplication / copy-paste errors.
For example when looking at the PoP for VALU FLOPs, we:

Compute the FLOPs
Compute the Peak
Recompute the FLOPs and the Peak to find the % of Peak

Other examples would be things like continually recomputing duration, GRBM_GUI_ACTIVE * $numCU, etc., etc.

If we had a more flexible evaluator, we could also theoretically let a user ask for just a specific metric, or group of metrics (and break the "ask for SQ on profile", "ask for section x.y on analyze" chain)

Implementation

Less clear what the best way to do this short of a full AST with dependency resolution would be.
In the short term, we could move more values to the list of builtin vars: https://github.com/AMDResearch/omniperf/blob/ed31b8a988b0fde6a5e11bc949417c82c6db1abc/src/omniperf_analyze/utils/parser.py#L77

Additional Notes

May not be as important if we can pre-compile metrics

1 reply

bwelton Aug 20, 2024
Collaborator

Posted to remember to follow up at a later point. We should consider long term merging the evaluation with rocprofv3. Whether that be pushing this evaluation out of rocprof entirely or doing it all in rocprof.

koomie · 2023-08-03T16:50:20Z

koomie
Aug 3, 2023
Maintainer

Describe the suggestion
Include priority based logger.

Justification
Allow for flexible control of output verbosity without making additional code changes. Useful for debugging and development.

Implementation
Standard use of python's logger library. Update all relevant stdout to use logger with assigned priority (info, warn, debug, etc).

0 replies

koomie · 2023-08-03T16:59:14Z

koomie
Aug 3, 2023
Maintainer

Describe the suggestion
Add embedded performance profiling capability.

Justification
Optional capability to enable reporting of execution time required across major functions within OmniPerf . Useful for ongoing development optimization and performance regression detection.

Implementation
Include a timer class that can be used to demarcate start/stop for regions of interest and aggregate wall-clock execution.

Additional Notes
Integrate with logger option mentioned above or have separate command-line argument to enable.

0 replies

coleramos425 · 2023-08-23T20:52:55Z

coleramos425
Aug 23, 2023
Maintainer Author

Describe the suggestion
CI/CD that spans ALL modes in Omniperf. Obtaining access to test nodes with Instinct cards that can be used in this pipeline.

Justification
At the moment CI/CD is incomplete in Omniperf - it only covers a handful of analyze mode command combinations. Without access to hardware to test Omniperf's profiling capabilities, profile mode is completely overlooked. If we obtain access to a test node, our code coverage would increase tremendously.

Additionally, there may be some room to re-think the procedure for the test cases we add/cover. At the moment, I don't think there's much of a methodology to the random list of command combinations tested.
(i.e. Adding testing for --gui flag, checking values of output tables between runs, etc.)

Implementation

Steal an HPC Fund node from Karl
Access what hardware we can spare for this "testing node"
Some level of IT required to establish connection between workflow files and HPC Fund
Come to an agreement on how list of test cases will be generated

Additional Notes
May want to loop @JoseSantosAMD into this as he's the engineer who set existing CI/CD up

1 reply

skyreflectedinmirrors Aug 24, 2023
Collaborator

Come to an agreement on how list of test cases will be generated

I have plenty of things we can mine for that.

coleramos425 · 2023-08-23T21:26:28Z

coleramos425
Aug 23, 2023
Maintainer Author

Describe the suggestion
Develop and document a clear procedure for adding a new SoC

Justification
With the changes in place to encapsulate all SoC unique properties in a class, adding a new SoC will be easier for maintainers. If we document this procedure it will also make things easier for contributors when they want to add an SoC that they're particularly interested in.

Implementation
Add relevant comments in source code as well as update documentation so the procedure is clear to external users

1 reply

DhruvDh Feb 29, 2024

will adding gfx1100 support to be a priority for maintainers in the near future?

coleramos425 · 2023-08-28T21:44:44Z

coleramos425
Aug 28, 2023
Maintainer Author

Describe the suggestion
Allow breakpoint/delimiters to specify "blocks" for profiling in application source code

Justification
This came up in the context of training ML models and enabling users to target specific stages in their ML training pipeline. In these multi-stage codes it would be helpful for users to understand performance and bottlenecks in different areas of execution

Implementation
There's a few ways this could be done, but the first that comes to mind is by leveraging rocscope. A modified version of the rocomni plugin could be used to gather counters for these user-defined blocks.

See internal planning page for more info...

Additional Notes
This would also lend itself nicely to an eventual VSCode extension

1 reply

skyreflectedinmirrors Aug 29, 2023
Collaborator

Conceptually, I think this would be possible via #153 (comment) as well

coleramos425 · 2023-08-29T16:41:27Z

coleramos425
Aug 29, 2023
Maintainer Author

Describe the suggestion
Combining the best of Granfana and standalone GUI

Justification
There are clear benefits to each of the GUI options we offer today. That being said, having two options (packaged separately in client & server side packages) can cause confusion.

Grafana GUI
(+) DB backend enables teams to share workload data for collaborative analysis
(+) Long-term benefits to building a central repository of performance data
(+) More performant and larger threshold for high dispatch workloads
(-) Queries dependent on a more niche MongoQL
(-) Requires a server-side setup
Standalone GUI
(+) Usage is straightforward
(+) Piggybacks directly off existing YAML infrastructure used in CLI
(+) HTML-based stack presents more flexibility for future development
(-) Less mature and lower threshold for high dispatch workloads

If we were able to re-think the Standalone GUI to include some of the best parts of the Grafana solution we'd be offering a much clearer solution to customer.

Additionally, axing Granfana would mean an unequivocal source of metric definitions - the YAML files. No more jumping between MongoQL and yaml.

Implementation
At the moment, the clear benefit in Grafana is the DB backend. If we were to offer options at the install level to bring up a DB backend. We could configure the standalone GUI to read workload data from that source, i.e.

$ omniperf analyze --host dummyhost --user amd --gui

Here the user could interact with a dropdown menu for workload selection similar to today's Grafana. If you're not interested in that, you could skip the flag at install stage (to ignore docker DB setup) and interact via CLI or Standalone GUI in today's "offline" manner.

Additional Notes
We could market this as an "online" vs. "offline" analysis in Omniperf docs

1 reply

skyreflectedinmirrors Aug 29, 2023
Collaborator

That is an interesting thought. Conceptually, we probably don't even care where the data comes from (local workload dir, or remote DB). I will note that the "punch a hole in ssh to port-forward" has been an issue for some customers. Perhaps we could also have the option for HTTP to be served by a central server as well?

Revamping Omniperf Architecture #153

coleramos425 Jul 19, 2023 Maintainer

Background

This is an open conversation around what improvements could be made to the architecture of Omniperf's source code

Contributing Guide

Update (9/12)

Replies: 13 comments · 13 replies

coleramos425 Jul 20, 2023 Maintainer Author

skyreflectedinmirrors Jul 20, 2023 Collaborator

skyreflectedinmirrors Jul 20, 2023 Collaborator

skyreflectedinmirrors Jul 21, 2023 Collaborator

bwelton Aug 20, 2024 Collaborator

feizheng10 Aug 22, 2024 Collaborator

skyreflectedinmirrors Jul 21, 2023 Collaborator

skyreflectedinmirrors Aug 17, 2023 Collaborator

skyreflectedinmirrors Jul 21, 2023 Collaborator

skyreflectedinmirrors Jul 28, 2023 Collaborator

skyreflectedinmirrors Jul 26, 2023 Collaborator

bwelton Aug 20, 2024 Collaborator

skyreflectedinmirrors Aug 20, 2024 Collaborator

skyreflectedinmirrors Jul 28, 2023 Collaborator

skyreflectedinmirrors Jul 28, 2023 Collaborator

bwelton Aug 20, 2024 Collaborator

koomie Aug 3, 2023 Maintainer

koomie Aug 3, 2023 Maintainer

coleramos425 Aug 23, 2023 Maintainer Author

skyreflectedinmirrors Aug 24, 2023 Collaborator

coleramos425 Aug 23, 2023 Maintainer Author

DhruvDh Feb 29, 2024

coleramos425 Aug 28, 2023 Maintainer Author

skyreflectedinmirrors Aug 29, 2023 Collaborator

coleramos425 Aug 29, 2023 Maintainer Author

skyreflectedinmirrors Aug 29, 2023 Collaborator

coleramos425
Jul 19, 2023
Maintainer

Replies: 13 comments 13 replies

coleramos425
Jul 20, 2023
Maintainer Author

skyreflectedinmirrors
Jul 20, 2023
Collaborator

skyreflectedinmirrors Jul 20, 2023
Collaborator

skyreflectedinmirrors Jul 21, 2023
Collaborator

bwelton Aug 20, 2024
Collaborator

feizheng10 Aug 22, 2024
Collaborator

skyreflectedinmirrors
Jul 21, 2023
Collaborator

skyreflectedinmirrors Aug 17, 2023
Collaborator

skyreflectedinmirrors
Jul 21, 2023
Collaborator

skyreflectedinmirrors Jul 28, 2023
Collaborator

skyreflectedinmirrors
Jul 26, 2023
Collaborator

bwelton Aug 20, 2024
Collaborator

skyreflectedinmirrors Aug 20, 2024
Collaborator

skyreflectedinmirrors
Jul 28, 2023
Collaborator

skyreflectedinmirrors
Jul 28, 2023
Collaborator

bwelton Aug 20, 2024
Collaborator

koomie
Aug 3, 2023
Maintainer

koomie
Aug 3, 2023
Maintainer

coleramos425
Aug 23, 2023
Maintainer Author

skyreflectedinmirrors Aug 24, 2023
Collaborator

coleramos425
Aug 23, 2023
Maintainer Author

coleramos425
Aug 28, 2023
Maintainer Author

skyreflectedinmirrors Aug 29, 2023
Collaborator

coleramos425
Aug 29, 2023
Maintainer Author

skyreflectedinmirrors Aug 29, 2023
Collaborator