Skip to content

Latest commit

 

History

History
53 lines (44 loc) · 3.83 KB

TODO.md

File metadata and controls

53 lines (44 loc) · 3.83 KB

TODO

  • test storage on AWS / using ceph on any cloud

Maybe

  • try webhooks again
  • Currently we have no representation of quota - we need to be able to set (and check) hard limits from the scheduler (or maybe we get that out of the box)?
  • klog can be changed to add V(2) to handle verbository from the command line, see https://pkg.go.dev/k8s.io/klog/v2

Completed

  • add port forward to python sdk, and example to programmatically do it, then we need to use that for automated tests
  • volume should have size request, not hard coded as it is now
  • make deployments own section of docs
  • interface still needs debugging for 2+ processes
  • diagnostics should be on the level of the container
  • the spec needs to support a local volume
  • Eventually; nice pretty, branded user docs that describe creating CRD, and cases of sleep infinity vs command
  • test better method from Aldo for networking
  • allow to specify that the app restful server is installed, and don't install again
  • we need to test that N=1 case works as expected (not waiting for any workers) and 0 spits an error (for now it doesn't make sense)
  • docs need spell checking!
  • Convert markdown docs into pretty, organized, rendered web-docs
  • Remove automated builds from here in favor of https://github.com/rse-ops/flux-hpc
  • Maximum time for job (seconds) set by CRD
  • Do we want to be able to launch additional tasks? (e.g., after the original job started) (for now, no, but this can be re-addressed if a case comes up)
  • Should there be a min/max size for the MiniCluster CRD (probably 2 if we want to have main/worker)? (right not just cannot be zero)
  • We will want an event/watcher to shut down all jobs when the main command is done. Otherwise the others sometimes keep running (currently we require all ranks to be ready and then they clean up)
  • what should be the proper start command for the main/worker nodes (this is important because it will determine when a job is complete) (rank 0 runs user command, workers just start)
  • figure out where to put flux hostname / config - volume needs write
  • Are --cores properly set (yes, not setting uses the default set by hwloc and that's resonable)
  • debug nodes finding on another (see How it works in README.md)
  • By what user? I am currently root but flux is an option (decided to use root to setup, but then run as flux user)
  • How we can print better verbose debugging output (possibly exposed by a variable) (done, debugging boolean is added)
  • And have some solid evidence the node communication is successful (or is the job running that evidence? (this now appears to be printing)
  • debug pod containers not seeing config again (e.g., mounts not creating)
  • Should the secondary (non-driver) pods have a different start command? (answer is no - with the Indexed job it's all the same command)
  • Details for etc-hosts (or will this just work? - no it won't just work)

Design 2 (not currently working on)

  • pkg/util/heap should implement an actual heap
  • kubebuilder should be able to provide defaults in the *_types.
  • Figure out logging connected to reconciler

Design 1 (not currently working on)

  • consolidate configmap functions into shared functionality (less redundancy)
  • Debug why the configmaps aren't being populated with the hostfile (it wasn't working with kind, worked without changes with minikube)
  • Figure out adding namespaces to config/samples - should be flux-operator
  • Each of config files written (e.g., hostname, broker, cert) should have their own types and more simply generated. The strategy right now is just temporary.
  • Stateful set (figure out how to create properly, doesn't seem to have pods) (figured out need to create ConfigMaps for Volumes)