TODO

test storage on AWS / using ceph on any cloud

Maybe

try webhooks again
Currently we have no representation of quota - we need to be able to set (and check) hard limits from the scheduler (or maybe we get that out of the box)?
klog can be changed to add V(2) to handle verbository from the command line, see https://pkg.go.dev/k8s.io/klog/v2

Completed

add port forward to python sdk, and example to programmatically do it, then we need to use that for automated tests
volume should have size request, not hard coded as it is now
make deployments own section of docs
interface still needs debugging for 2+ processes
diagnostics should be on the level of the container
the spec needs to support a local volume
Eventually; nice pretty, branded user docs that describe creating CRD, and cases of sleep infinity vs command
test better method from Aldo for networking
allow to specify that the app restful server is installed, and don't install again
we need to test that N=1 case works as expected (not waiting for any workers) and 0 spits an error (for now it doesn't make sense)
docs need spell checking!
Convert markdown docs into pretty, organized, rendered web-docs
Remove automated builds from here in favor of https://github.com/rse-ops/flux-hpc
Maximum time for job (seconds) set by CRD
Do we want to be able to launch additional tasks? (e.g., after the original job started) (for now, no, but this can be re-addressed if a case comes up)
Should there be a min/max size for the MiniCluster CRD (probably 2 if we want to have main/worker)? (right not just cannot be zero)
We will want an event/watcher to shut down all jobs when the main command is done. Otherwise the others sometimes keep running (currently we require all ranks to be ready and then they clean up)
what should be the proper start command for the main/worker nodes (this is important because it will determine when a job is complete) (rank 0 runs user command, workers just start)
figure out where to put flux hostname / config - volume needs write
Are --cores properly set (yes, not setting uses the default set by hwloc and that's resonable)
debug nodes finding on another (see How it works in README.md)
By what user? I am currently root but flux is an option (decided to use root to setup, but then run as flux user)
How we can print better verbose debugging output (possibly exposed by a variable) (done, debugging boolean is added)
And have some solid evidence the node communication is successful (or is the job running that evidence? (this now appears to be printing)
debug pod containers not seeing config again (e.g., mounts not creating)
Should the secondary (non-driver) pods have a different start command? (answer is no - with the Indexed job it's all the same command)
Details for etc-hosts (or will this just work? - no it won't just work)

Design 2 (not currently working on)

pkg/util/heap should implement an actual heap
kubebuilder should be able to provide defaults in the *_types.
Figure out logging connected to reconciler

Design 1 (not currently working on)

consolidate configmap functions into shared functionality (less redundancy)
Debug why the configmaps aren't being populated with the hostfile (it wasn't working with kind, worked without changes with minikube)
Figure out adding namespaces to config/samples - should be flux-operator
Each of config files written (e.g., hostname, broker, cert) should have their own types and more simply generated. The strategy right now is just temporary.
Stateful set (figure out how to create properly, doesn't seem to have pods) (figured out need to create ConfigMaps for Volumes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO.md

TODO.md

TODO

Maybe

Completed

Design 2 (not currently working on)

Design 1 (not currently working on)

Files

TODO.md

Latest commit

History

TODO.md

File metadata and controls

TODO

Maybe

Completed

Design 2 (not currently working on)

Design 1 (not currently working on)