Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show when runs are being processed #322

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

takluyver
Copy link
Member

Put an icon next to the run number while a processing job is working on a run. The tooltip if you hover over that cell shows more details. This will also mean that new runs appear in the table when the backend starts processing them, instead of when the first job finishes.

Screencast from 2024-08-29 12-35-34

The status column would be another option, but we recently discussed hiding that by default.

@takluyver takluyver added the enhancement New feature or request label Aug 29, 2024
@takluyver takluyver requested a review from JamesWrigley August 29, 2024 11:51
Copy link

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 80.89888% with 34 lines in your changes missing coverage. Please review.

Project coverage is 75.60%. Comparing base (229568e) to head (6e98461).

Files with missing lines Patch % Lines
damnit/gui/table.py 74.50% 13 Missing ⚠️
damnit/backend/extract_data.py 83.78% 12 Missing ⚠️
damnit/backend/extraction_control.py 89.36% 5 Missing ⚠️
damnit/gui/main_window.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
+ Coverage   75.47%   75.60%   +0.13%     
==========================================
  Files          32       32              
  Lines        5137     5280     +143     
==========================================
+ Hits         3877     3992     +115     
- Misses       1260     1288      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JamesWrigley
Copy link
Member

JamesWrigley commented Aug 29, 2024

I confess I don't really like that the implementation depends on the GUI being running throughout the processing time, it'd be quite confusing to see some runs being processed in one window and them not being processed in another (particularly important for long-running jobs like at SPB and MID). I'd prefer to do this properly and save the job state in the database.

Other things:

  • If I read the code correctly this will show a job still running even if the slurm job was preempted, which is quite possible on exfel. We should periodically (every ~2 minutes or so) check slurm to see if the job is still alive.
  • This also doesn't account for multiple jobs running simultaneously for the same run, but I think we can overlook that for now since we don't really support it anyway. (oops, no that's not true)

@JamesWrigley
Copy link
Member

We should also have at least some tests 😅

@takluyver
Copy link
Member Author

I know it doesn't work correctly for GUIs launched while a run is already processing, but I chose to do it this way because:

  • I'm trying to avoid writing to the DB when every extractor process starts, because the thundering herd can hit database timeouts waiting for the lock.
  • We probably want the Kafka messages anyway, for prompt updates, so I don't think this makes it any harder to do a better implementation later.
    • It might even be possible to use Kafka's own persistence, rewinding a few hours and replaying messages from before the GUI started. Not sure about this yet.
    • Another option would be for the extract_data machinery to send out 'still running' messages every couple of minutes while the child process runs the context file, so any newly launched GUIs gets the correct state after a while.
  • From what I've seen at FXE, it's normal to leave windows open for long periods, so the simple implementation is already useful.

this will show a job still running even if the slurm job was preempted, which is quite possible on exfel. We should periodically (every ~2 minutes or so) check slurm to see if the job is still alive.

Yup, makes sense.

@JamesWrigley
Copy link
Member

The main reason I'm against this implementation is that it can lie to the user. That's a hard blocker for me, we should avoid displaying wrong information at all costs. A secondary reason is that we need the same functionality for the web interface, so I'd prefer if we didn't implement something that will specifically only work for the Qt GUI.

  • From what I've seen at FXE, it's normal to leave windows open for long periods, so the simple implementation is already useful.

I would rather not have the feature at all than display potentially wrong information 🤷 Even apart from that, other users doing analysis sometimes open the GUI on fastx so it doesn't really matter if the session on the control hutch PC's is always running.

@takluyver
Copy link
Member Author

Would you be OK with this if any incorrect information was time-limited, so it got the right state within, say, 60 seconds after you launch the GUI? Or would that still be unacceptable?

we need the same functionality for the web interface, so I'd prefer if we didn't implement something that will specifically only work for the Qt GUI.

I'd still build that around the backend sending out messages similar to these, just that some server piece would have to pass them on to the frontend. I don't see this as something that only works for Qt (besides the parts in the GUI code, of course).

@JamesWrigley
Copy link
Member

Not a fan of eventual consistency in interfaces 😛 It's also kinda confusing that people opening a fresh session will think that certain runs just started reprocessing. But as long as it's not too long then then it's ok, though I'd lightly prefer a max of like 30s than 60s.

@takluyver
Copy link
Member Author

OK, already processing runs should show up within 10 seconds, i.e. they send a 'running' message every 10 seconds.

The GUI checks every 2 minutes, as you suggested, for Slurm jobs that have exited without sending a 'finished' message. I think it should also be feasible to catch most types of failure & cancellation and send the message before the job exits, but this is a good backstop in any case.

damnit/backend/extract_data.py Show resolved Hide resolved
damnit/gui/table.py Outdated Show resolved Hide resolved
@JamesWrigley
Copy link
Member

JamesWrigley commented Aug 30, 2024

Just for context, I would still like to refactor this in the future to save the job status' in the database so we can alert users if a job was preempted or failed for some reason (e.g. timeouts).

@takluyver takluyver dismissed JamesWrigley’s stale review October 18, 2024 11:28

Requested changes made

@takluyver
Copy link
Member Author

I think this probably needs a bit of a change now that we're using array jobs, to make sure that the different ways of identifying Slurm jobs line up.

@takluyver
Copy link
Member Author

I did the change for Slurm array jobs. I looked into using squeue --json to get the job statuses, but it doesn't seem to work with the --jobs option & selecting >1 job, giving you a mountain of output for all jobs instead. So I went back to the splitting lines of text approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants