Revive this? #8

danpovey · 2018-09-05T16:03:36Z

@njoly @bodgerer I'm wondering how much you guys know about GridEngine internals, or if you know anyone else who might? Dave Love seems to have disappeared again, and I'm wondering who else might have a deep knowledge of GridEngine and be willing to maintain a GitHub-based version of the project?

addylaszlokovacs · 2018-09-05T16:39:37Z

It should be maintained. I am interested in but my contribution depends on how much work should be solved myself. Anyway I have some improvement ideas also, maybe we can do it together and write some articles in conference proceedings and journal issues. Sent from my Huawei Mobile-------- Original Message --------Subject: [son-of-gridengine/sge] Revive this? (#8)From: Daniel Povey To: son-of-gridengine/sge CC: Subscribed @njoly @bodgerer I'm wondering how much you guys know about GridEngine internals, or if you know anyone else who might? Dave Love seems to have disappeared again, and I'm wondering who else might have a deep knowledge of GridEngine and be willing to maintain a GitHub-based version of the project? —You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/son-of-gridengine/sge","title":"son-of-gridengine/sge","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/son-of-gridengine/sge"}},"updates":{"snippets":[{"icon":"DESCRIPTION","message":"Revive this? (#8)"}],"action":{"name":"View Issue","url":"#8"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "#8", "url": "#8", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Revive this? (#8)", "sections": [ { "text": "", "activityTitle": "**Daniel Povey**", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@danpovey", "facts": [ { "name": "Repository: ", "value": "son-of-gridengine/sge" }, { "name": "Issue #: ", "value": 8 } ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"son-of-gridengine/sge\",\n\"issueId\": 8,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"son-of-gridengine/sge\",\n\"issueId\": 8\n}" }, { "targets": [ { "os": "default", "uri": "#8" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 376235778\n}" } ], "themeColor": "26292E" } ]

ppoilbarbe · 2018-09-05T20:58:24Z

Hello,
I sent patches in the past but I don't know the internals enough to be an efficient maintainer.
If I interfere in the discussion, it is because I don't understand why maintaining a github-based version of the project would be different from another SCM. Except that on github, it is more visible and easy to clone than the previous one (or previous ones if I understood correctly).
Concerning SGE, I am more a user with a cluster of about 150 heterogeneous nodes to administrate (desktop and rack computers in datacenter).

danpovey · 2018-09-05T21:04:47Z

It's not really about the specific SCM, it's about who will pick up the torch when Dave stops maintaining it-- which may have already appened.

mightybigcar · 2018-09-05T21:35:56Z

I'd be interesting in contributing, though my understanding of GridEngine internals is kind of shallow at this point. Mainly I know a few very tiny spots where I fixed some compile failures when trying to rebuild from source last spring, before Dave Love made his Fedora 28 packages available. Chris

bodgerer · 2018-09-06T09:56:12Z

Hi all,

I've worked on a few of the subsystems within gridengine, fixing a few horrible problems, and am happy to share what I've learned and help out if I can. Regarding maintaining a github based fork, I'm not sure my day job would give me time (SGE admin since version 5.3, currently have 3 SoGE clusters / 750 nodes).

I might be seeing Dave later this month, but he did write on the list that he'd be happy for someone to take over maintenance - as long as he gets to properly hand things over. He's done a great job with SoGE over the past years.

Mark

danpovey · 2018-09-06T16:01:59Z

It would be cool if you see Dave and talk to him about it. My concern is there may not be many people who know the internals well enough. Personally I don't think I have time to be the main organizer.

…

On Thu, Sep 6, 2018 at 5:56 AM Mark Dixon ***@***.***> wrote: Hi all, I've worked on a few of the subsystems within gridengine, fixing a few horrible problems, and am happy to share what I've learned and help out if I can. Regarding maintaining a github based fork, I'm not sure my day job would give me time (SGE admin since version 5.3, currently have 3 SoGE clusters / 750 nodes). I might be seeing Dave later this month, but he did write on the list that he'd be happy for someone to take over maintenance - as long as he gets to properly hand things over. He's done a great job with SoGE over the past years. Mark — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu5gHVx-9DsJyZvxKZyLQU0Ubrzd8ks5uYPE9gaJpZM4WbOcC> .

bodgerer · 2018-09-13T08:46:01Z

I'm not sure the problem is not many people who know the internals well enough: it's that there is a general lack of people interested in reading and editing the source code. I'm not sure how to fix that.

ppoilbarbe · 2018-09-15T18:16:44Z

Some years ago, I read the code (to find a memory leak problem in masterd). And really there is many things that I don't understand. And if I don't understand what is around what I modify, I'm not sure to be more restorative than destructive.

bodgerer · 2018-09-17T09:52:11Z

I guess it needs better documentation, at the very least. I'll see if I can find time to start some.

danpovey · 2018-09-17T16:11:46Z

What level of documentation were you talking about here-- code-level vs. repo-level (e.g. things like installation)? BTW I remember we heard from Dave at some point that there were some things in this repo (we cloned it from his master IIRC) that weren't intended for release. So at some point it might make sense to look at that. But I don't really have time to drive this forward properly.

…

On Mon, Sep 17, 2018 at 5:52 AM Mark Dixon ***@***.***> wrote: I guess it needs better documentation, at the very least. I'll see if I can find time to start some. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu-vBU0H7gikan_iOkg4UMcQJC89oks5ub3DMgaJpZM4WbOcC> .

bodgerer · 2018-09-18T14:38:14Z

I was thinking code-level, e.g. a getting started guide for prospective developers

danpovey · 2018-10-02T18:32:09Z

Mark-- sorry I didn't respond earlier... I was thinking about this. I think what you propose is very useful but we need to make sure we start from the right starting point, e.g. the right version of the code, and don't end up starting something that's not going to sustain itself. Did you end up seeing Dave?

…

On Tue, Sep 18, 2018 at 10:39 AM Mark Dixon ***@***.***> wrote: I was thinking code-level, e.g. a getting started guide for prospective developers — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuzcA0RZVYpeLuwD5OJMlhAA74SSdks5ucQWYgaJpZM4WbOcC> .

bodgerer · 2018-10-24T08:14:01Z

Hi Dan,

Sorry, forgot to reply. Yes, I saw Dave - but unfortunately we didn't have time to discuss gridengine.

Cheers,

Mark

mightybigcar · 2018-10-27T17:25:00Z

OK, it looks like I'm going to get into maintenance in a serious way soon. I've got requests to get SGE up and running on MacOS X (yikes!), and to add GPU memory allocation support (looks like we can build on gpu_loadsensor.c for that).

Currently we're running Dave Love's Fedora 28 build on our systems, so it would be best to build on that. Unfortunately, I can't seem to download the corresponding SRPMs - has anyone figured which commit(s)/tag(s) in this tree or one of Dave's correspond to that build?

Thanks,
Chris

danpovey · 2018-10-27T17:35:15Z

That sounds great! I'm sorry, I don't know that. Honestly, I don't even know what an SPRM is. Regarding GPU memory allocation support: my feeling is that may be a little bit ambitious-- i.e. anything that requires integration of GridEngine code with NVidia code. GridEngine is already extremely complicated, and its build process nontrivial. Worrying about different NVidia library versions seems like potentially a step too far, but that's just my opinion. Perhaps it might be easier to support for allocating particular GPUs (e.g. by exporting the CUDA_VISIBLE_DEVICES variable). I think this is what you can do with Univa GridEngine. Might be nice to maintain compatibility with that. Dan

…

On Sat, Oct 27, 2018 at 1:25 PM Christopher Heiny ***@***.***> wrote: OK, it looks like I'm going to get into maintenance in a serious way soon. I've got requests to get SGE up and running on MacOS X (yikes!), and to add GPU memory allocation support (looks like we can build on gpu_loadsensor.c for that). Currently we're running Dave Love's Fedora 28 build on our systems, so it would be best to build on that. Unfortunately, I can't seem to download the corresponding SRPMs - has anyone figured which commit(s)/tag(s) in this tree or one of Dave's correspond to that build? Thanks, Chris — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu8gR7FLTmrb4aKt4txHkSEcS1g44ks5upJbtgaJpZM4WbOcC> .

mightybigcar · 2018-10-27T17:54:55Z

No prob - in an RPM based system like Fedora, the SRPM is shorthand for the source package corresponding to a given binary package (RPM). In theory you can rebuild the binaries directly from that source package. Being able to do that would help insure that the MacOS port was fully in sync with the Fedora implementation. We're already using prolog/epilog scripts to manage GPUs on a per GPU basis as you describe. But many use cases don't require all the resources on a GPU, and uses are grumbling that they'd like to be able to share GPUs if possible. This is could be done by extending the prolog/epilog, but I'd like to reduce the amount of manual intervention and thought perhaps the load sensor would help with that. Chris

danpovey · 2018-10-27T18:03:11Z

OK, thanks. I do a lot of GPU stuff myself (as part of the Kaldi project). People often do ask about whether they can do that (share GPUs), but I always advise against it. The reason is that for well-written programs, generally speaking, sharing the GPU will give you worse performance than just running them sequentially. Of course not all programs are well-written. But if it's a multi-user cluster, it's hard to know what GPU resources a program will *eventually* need. Even if it's not using all the GPU memory now, it could well use it later on. So to avoid this happening, you'd have to have a resource like gpu_ram for the jobs to announce in advance how much GPU memory they will need -- and you'd have to have a mechanism available to kill non-compliant jobs or warn if they go over. It's doable, of course, but it's a huge pain. The problem is, implementing what you are talking about will bring *other* problems - complexity in the build system, difficulty with co-ordinating with NVidia packages and drivers.

…

On Sat, Oct 27, 2018 at 1:54 PM Christopher Heiny ***@***.***> wrote: No prob - in an RPM based system like Fedora, the SRPM is shorthand for the source package corresponding to a given binary package (RPM). In theory you can rebuild the binaries directly from that source package. Being able to do that would help insure that the MacOS port was fully in sync with the Fedora implementation. We're already using prolog/epilog scripts to manage GPUs on a per GPU basis as you describe. But many use cases don't require all the resources on a GPU, and uses are grumbling that they'd like to be able to share GPUs if possible. This is could be done by extending the prolog/epilog, but I'd like to reduce the amount of manual intervention and thought perhaps the load sensor would help with that. Chris — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu7UklIpHFf2WvFPKl0q3afGfK_Nbks5upJ3vgaJpZM4WbOcC> .

danpovey · 2018-10-27T18:05:37Z

BTW, if you do an ls-remote of this repository, you'll see a lot of branches and tags. One of them may be what you need. These were originally imported from a repo of Dave Love. But I don't fully trust them.

…

On Sat, Oct 27, 2018 at 2:02 PM Daniel Povey ***@***.***> wrote: OK, thanks. I do a lot of GPU stuff myself (as part of the Kaldi project). People often do ask about whether they can do that (share GPUs), but I always advise against it. The reason is that for well-written programs, generally speaking, sharing the GPU will give you worse performance than just running them sequentially. Of course not all programs are well-written. But if it's a multi-user cluster, it's hard to know what GPU resources a program will *eventually* need. Even if it's not using all the GPU memory now, it could well use it later on. So to avoid this happening, you'd have to have a resource like gpu_ram for the jobs to announce in advance how much GPU memory they will need -- and you'd have to have a mechanism available to kill non-compliant jobs or warn if they go over. It's doable, of course, but it's a huge pain. The problem is, implementing what you are talking about will bring *other* problems - complexity in the build system, difficulty with co-ordinating with NVidia packages and drivers. On Sat, Oct 27, 2018 at 1:54 PM Christopher Heiny < ***@***.***> wrote: > No prob - in an RPM based system like Fedora, the SRPM is shorthand for > the > source package corresponding to a given binary package (RPM). In theory > you > can rebuild the binaries directly from that source package. Being able to > do that would help insure that the MacOS port was fully in sync with the > Fedora implementation. > > We're already using prolog/epilog scripts to manage GPUs on a per GPU > basis > as you describe. But many use cases don't require all the resources on a > GPU, and uses are grumbling that they'd like to be able to share GPUs if > possible. This is could be done by extending the prolog/epilog, but I'd > like to reduce the amount of manual intervention and thought perhaps the > load sensor would help with that. > > Chris > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#8 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ADJVu7UklIpHFf2WvFPKl0q3afGfK_Nbks5upJ3vgaJpZM4WbOcC> > . >

mightybigcar · 2018-10-27T21:55:19Z

Hmmm. There is a tag in there that might refer to 8.1.9, but it's not clear if that corresponds directly to the F28 packages. At least it's a starting point.

mightybigcar · 2018-10-27T22:07:00Z

I'm generally with you on the "just use the whole GPU" approach. But the machine learning team doesn't like to take "because I say so" as a reason, and sometimes they're quite justified in that.

With some more poking around, it looks like we can use prolog/epilog to manage this data for the well behaved programs that will account for the initial use cases. Hopefully we won't need to resort to a bunch of messing with the SGE internals.

prod-feng · 2018-10-27T23:52:13Z

Do you just run serial GPU jobs? It seems to me that prolog and epilog only run on master node for master process only, not slave process for parallel jobs. Best, Feng Best, Feng

…

On Sat, Oct 27, 2018 at 6:07 PM Christopher Heiny ***@***.***> wrote: I'm generally with you on the "just use the whole GPU" approach. But the machine learning team doesn't like to take "because I say so" as a reason, and sometimes they're quite justified in that. With some more poking around, it looks like we can use prolog/epilog to manage this data for the well behaved programs that will account for the initial use cases. Hopefully we won't need to resort to a bunch of messing with the SGE internals. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

mightybigcar · 2018-10-28T02:46:29Z

Good point, Feng. We run serial jobs, using one or more GPUs on a single node. At some point we might get more ambitious, but I'll worry about that bridge if/when we get to it.

…

bodgerer · 2018-10-29T09:51:35Z

On Sat, 27 Oct 2018, Christopher Heiny wrote: ...

We're already using prolog/epilog scripts to manage GPUs on a per GPU basis as you describe. But many use cases don't require all the resources on a GPU, and uses are grumbling that they'd like to be able to share GPUs if possible. This is could be done by extending the prolog/epilog, but I'd like to reduce the amount of manual intervention and thought perhaps the load sensor would help with that.

... I keep wondering if each GPU vendor's virtualisation stuff could be used to deal with this sort of thing, e.g. NVIDIA GRID, but don't have such a card to play with. I'm coming at it from the try-to-make-VirtualGL-secure and a-whole-card-is-overkill-for-much-remote-visualistion angles. Coincidentally enough, I was about to extend our GPU handling too (we use a daemon to allocate them - and other things - via the starter method), so that we properly log some performance stats against jobs. Mark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revive this? #8

Revive this? #8

danpovey commented Sep 5, 2018

addylaszlokovacs commented Sep 5, 2018 via email

ppoilbarbe commented Sep 5, 2018

danpovey commented Sep 5, 2018

mightybigcar commented Sep 5, 2018 via email

bodgerer commented Sep 6, 2018

danpovey commented Sep 6, 2018 via email

bodgerer commented Sep 13, 2018

ppoilbarbe commented Sep 15, 2018

bodgerer commented Sep 17, 2018

danpovey commented Sep 17, 2018 via email

bodgerer commented Sep 18, 2018

danpovey commented Oct 2, 2018 via email

bodgerer commented Oct 24, 2018

mightybigcar commented Oct 27, 2018

danpovey commented Oct 27, 2018 via email

mightybigcar commented Oct 27, 2018 via email

danpovey commented Oct 27, 2018 via email

danpovey commented Oct 27, 2018 via email

mightybigcar commented Oct 27, 2018

mightybigcar commented Oct 27, 2018

prod-feng commented Oct 27, 2018 via email

mightybigcar commented Oct 28, 2018 via email

bodgerer commented Oct 29, 2018 via email

Revive this? #8

Revive this? #8

Comments

danpovey commented Sep 5, 2018

addylaszlokovacs commented Sep 5, 2018 via email

ppoilbarbe commented Sep 5, 2018

danpovey commented Sep 5, 2018

mightybigcar commented Sep 5, 2018 via email

bodgerer commented Sep 6, 2018

danpovey commented Sep 6, 2018 via email

bodgerer commented Sep 13, 2018

ppoilbarbe commented Sep 15, 2018

bodgerer commented Sep 17, 2018

danpovey commented Sep 17, 2018 via email

bodgerer commented Sep 18, 2018

danpovey commented Oct 2, 2018 via email

bodgerer commented Oct 24, 2018

mightybigcar commented Oct 27, 2018

danpovey commented Oct 27, 2018 via email

mightybigcar commented Oct 27, 2018 via email

danpovey commented Oct 27, 2018 via email

danpovey commented Oct 27, 2018 via email

mightybigcar commented Oct 27, 2018

mightybigcar commented Oct 27, 2018

prod-feng commented Oct 27, 2018 via email

mightybigcar commented Oct 28, 2018 via email

bodgerer commented Oct 29, 2018 via email