-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revive this? #8
Comments
It should be maintained. I am interested in but my contribution depends on how much work should be solved myself. Anyway I have some improvement ideas also, maybe we can do it together and write some articles in conference proceedings and journal issues. Sent from my Huawei Mobile-------- Original Message --------Subject: [son-of-gridengine/sge] Revive this? (#8)From: Daniel Povey To: son-of-gridengine/sge CC: Subscribed @njoly @bodgerer I'm wondering how much you guys know about GridEngine internals, or if you know anyone else who might? Dave Love seems to have disappeared again, and I'm wondering who else might have a deep knowledge of GridEngine and be willing to maintain a GitHub-based version of the project?
—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/son-of-gridengine/sge","title":"son-of-gridengine/sge","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/son-of-gridengine/sge"}},"updates":{"snippets":[{"icon":"DESCRIPTION","message":"Revive this? (#8)"}],"action":{"name":"View Issue","url":"#8"}}}
[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "#8",
"url": "#8",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
},
{
"@type": "MessageCard",
"@context": "http://schema.org/extensions",
"hideOriginalBody": "false",
"originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB",
"title": "Revive this? (#8)",
"sections": [
{
"text": "",
"activityTitle": "**Daniel Povey**",
"activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png",
"activitySubtitle": "@danpovey",
"facts": [
{
"name": "Repository: ",
"value": "son-of-gridengine/sge"
},
{
"name": "Issue #: ",
"value": 8
}
]
}
],
"potentialAction": [
{
"name": "Add a comment",
"@type": "ActionCard",
"inputs": [
{
"isMultiLine": true,
"@type": "TextInput",
"id": "IssueComment",
"isRequired": false
}
],
"actions": [
{
"name": "Comment",
"@type": "HttpPOST",
"target": "https://api.github.com",
"body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"son-of-gridengine/sge\",\n\"issueId\": 8,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}"
}
]
},
{
"name": "Close issue",
"@type": "HttpPOST",
"target": "https://api.github.com",
"body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"son-of-gridengine/sge\",\n\"issueId\": 8\n}"
},
{
"targets": [
{
"os": "default",
"uri": "#8"
}
],
"@type": "OpenUri",
"name": "View on GitHub"
},
{
"name": "Unsubscribe",
"@type": "HttpPOST",
"target": "https://api.github.com",
"body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 376235778\n}"
}
],
"themeColor": "26292E"
}
]
|
Hello, |
It's not really about the specific SCM, it's about who will pick up the torch when Dave stops maintaining it-- which may have already appened. |
I'd be interesting in contributing, though my understanding of
GridEngine internals is kind of shallow at this point. Mainly I know a
few very tiny spots where I fixed some compile failures when trying to
rebuild from source last spring, before Dave Love made his Fedora 28
packages available.
Chris
|
Hi all, I've worked on a few of the subsystems within gridengine, fixing a few horrible problems, and am happy to share what I've learned and help out if I can. Regarding maintaining a github based fork, I'm not sure my day job would give me time (SGE admin since version 5.3, currently have 3 SoGE clusters / 750 nodes). I might be seeing Dave later this month, but he did write on the list that he'd be happy for someone to take over maintenance - as long as he gets to properly hand things over. He's done a great job with SoGE over the past years. Mark |
It would be cool if you see Dave and talk to him about it. My concern is
there may not be many people who know the internals well enough.
Personally I don't think I have time to be the main organizer.
…On Thu, Sep 6, 2018 at 5:56 AM Mark Dixon ***@***.***> wrote:
Hi all,
I've worked on a few of the subsystems within gridengine, fixing a few
horrible problems, and am happy to share what I've learned and help out if
I can. Regarding maintaining a github based fork, I'm not sure my day job
would give me time (SGE admin since version 5.3, currently have 3 SoGE
clusters / 750 nodes).
I might be seeing Dave later this month, but he did write on the list that
he'd be happy for someone to take over maintenance - as long as he gets to
properly hand things over. He's done a great job with SoGE over the past
years.
Mark
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu5gHVx-9DsJyZvxKZyLQU0Ubrzd8ks5uYPE9gaJpZM4WbOcC>
.
|
I'm not sure the problem is not many people who know the internals well enough: it's that there is a general lack of people interested in reading and editing the source code. I'm not sure how to fix that. |
Some years ago, I read the code (to find a memory leak problem in masterd). And really there is many things that I don't understand. And if I don't understand what is around what I modify, I'm not sure to be more restorative than destructive. |
I guess it needs better documentation, at the very least. I'll see if I can find time to start some. |
What level of documentation were you talking about here-- code-level vs.
repo-level (e.g. things like installation)?
BTW I remember we heard from Dave at some point that there were some things
in this repo (we cloned it from his master IIRC) that weren't intended for
release. So at some point it might make sense to look at that. But I
don't really have time to drive this forward properly.
…On Mon, Sep 17, 2018 at 5:52 AM Mark Dixon ***@***.***> wrote:
I guess it needs better documentation, at the very least. I'll see if I
can find time to start some.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu-vBU0H7gikan_iOkg4UMcQJC89oks5ub3DMgaJpZM4WbOcC>
.
|
I was thinking code-level, e.g. a getting started guide for prospective developers |
Mark-- sorry I didn't respond earlier... I was thinking about this. I
think what you propose is very useful but we need to make sure we start
from the right starting point, e.g. the right version of the code, and
don't end up starting something that's not going to sustain itself. Did
you end up seeing Dave?
…On Tue, Sep 18, 2018 at 10:39 AM Mark Dixon ***@***.***> wrote:
I was thinking code-level, e.g. a getting started guide for prospective
developers
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuzcA0RZVYpeLuwD5OJMlhAA74SSdks5ucQWYgaJpZM4WbOcC>
.
|
Hi Dan, Sorry, forgot to reply. Yes, I saw Dave - but unfortunately we didn't have time to discuss gridengine. Cheers, Mark |
OK, it looks like I'm going to get into maintenance in a serious way soon. I've got requests to get SGE up and running on MacOS X (yikes!), and to add GPU memory allocation support (looks like we can build on gpu_loadsensor.c for that). Currently we're running Dave Love's Fedora 28 build on our systems, so it would be best to build on that. Unfortunately, I can't seem to download the corresponding SRPMs - has anyone figured which commit(s)/tag(s) in this tree or one of Dave's correspond to that build? Thanks, |
That sounds great!
I'm sorry, I don't know that. Honestly, I don't even know what an SPRM is.
Regarding GPU memory allocation support: my feeling is that may be a little
bit ambitious-- i.e. anything that requires integration of GridEngine code
with NVidia code. GridEngine is already extremely complicated, and its
build process nontrivial. Worrying about different NVidia library versions
seems like potentially a step too far, but that's just my opinion. Perhaps
it might be easier to support for allocating particular GPUs (e.g. by
exporting the CUDA_VISIBLE_DEVICES variable). I think this is what you can
do with Univa GridEngine. Might be nice to maintain compatibility with
that.
Dan
…On Sat, Oct 27, 2018 at 1:25 PM Christopher Heiny ***@***.***> wrote:
OK, it looks like I'm going to get into maintenance in a serious way soon.
I've got requests to get SGE up and running on MacOS X (yikes!), and to add
GPU memory allocation support (looks like we can build on gpu_loadsensor.c
for that).
Currently we're running Dave Love's Fedora 28 build on our systems, so it
would be best to build on that. Unfortunately, I can't seem to download the
corresponding SRPMs - has anyone figured which commit(s)/tag(s) in this
tree or one of Dave's correspond to that build?
Thanks,
Chris
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu8gR7FLTmrb4aKt4txHkSEcS1g44ks5upJbtgaJpZM4WbOcC>
.
|
No prob - in an RPM based system like Fedora, the SRPM is shorthand for the
source package corresponding to a given binary package (RPM). In theory you
can rebuild the binaries directly from that source package. Being able to
do that would help insure that the MacOS port was fully in sync with the
Fedora implementation.
We're already using prolog/epilog scripts to manage GPUs on a per GPU basis
as you describe. But many use cases don't require all the resources on a
GPU, and uses are grumbling that they'd like to be able to share GPUs if
possible. This is could be done by extending the prolog/epilog, but I'd
like to reduce the amount of manual intervention and thought perhaps the
load sensor would help with that.
Chris
|
OK, thanks.
I do a lot of GPU stuff myself (as part of the Kaldi project). People
often do ask about whether they can do that (share GPUs), but I always
advise against it. The reason is that for well-written programs,
generally speaking, sharing the GPU will give you worse performance than
just running them sequentially.
Of course not all programs are well-written. But if it's a multi-user
cluster, it's hard to know what GPU resources a program will *eventually*
need. Even if it's not using all the GPU memory now, it could well use it
later on. So to avoid this happening, you'd have to have a resource like
gpu_ram for the jobs to announce in advance how much GPU memory they will
need -- and you'd have to have a mechanism available to kill non-compliant
jobs or warn if they go over. It's doable, of course, but it's a huge
pain. The problem is, implementing what you are talking about will bring
*other* problems - complexity in the build system, difficulty with
co-ordinating with NVidia packages and drivers.
…On Sat, Oct 27, 2018 at 1:54 PM Christopher Heiny ***@***.***> wrote:
No prob - in an RPM based system like Fedora, the SRPM is shorthand for the
source package corresponding to a given binary package (RPM). In theory you
can rebuild the binaries directly from that source package. Being able to
do that would help insure that the MacOS port was fully in sync with the
Fedora implementation.
We're already using prolog/epilog scripts to manage GPUs on a per GPU basis
as you describe. But many use cases don't require all the resources on a
GPU, and uses are grumbling that they'd like to be able to share GPUs if
possible. This is could be done by extending the prolog/epilog, but I'd
like to reduce the amount of manual intervention and thought perhaps the
load sensor would help with that.
Chris
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu7UklIpHFf2WvFPKl0q3afGfK_Nbks5upJ3vgaJpZM4WbOcC>
.
|
BTW, if you do an ls-remote of this repository, you'll see a lot of
branches and tags. One of them may be what you need. These were
originally imported from a repo of Dave Love. But I don't fully trust them.
…On Sat, Oct 27, 2018 at 2:02 PM Daniel Povey ***@***.***> wrote:
OK, thanks.
I do a lot of GPU stuff myself (as part of the Kaldi project). People
often do ask about whether they can do that (share GPUs), but I always
advise against it. The reason is that for well-written programs,
generally speaking, sharing the GPU will give you worse performance than
just running them sequentially.
Of course not all programs are well-written. But if it's a multi-user
cluster, it's hard to know what GPU resources a program will *eventually*
need. Even if it's not using all the GPU memory now, it could well use it
later on. So to avoid this happening, you'd have to have a resource like
gpu_ram for the jobs to announce in advance how much GPU memory they will
need -- and you'd have to have a mechanism available to kill non-compliant
jobs or warn if they go over. It's doable, of course, but it's a huge
pain. The problem is, implementing what you are talking about will bring
*other* problems - complexity in the build system, difficulty with
co-ordinating with NVidia packages and drivers.
On Sat, Oct 27, 2018 at 1:54 PM Christopher Heiny <
***@***.***> wrote:
> No prob - in an RPM based system like Fedora, the SRPM is shorthand for
> the
> source package corresponding to a given binary package (RPM). In theory
> you
> can rebuild the binaries directly from that source package. Being able to
> do that would help insure that the MacOS port was fully in sync with the
> Fedora implementation.
>
> We're already using prolog/epilog scripts to manage GPUs on a per GPU
> basis
> as you describe. But many use cases don't require all the resources on a
> GPU, and uses are grumbling that they'd like to be able to share GPUs if
> possible. This is could be done by extending the prolog/epilog, but I'd
> like to reduce the amount of manual intervention and thought perhaps the
> load sensor would help with that.
>
> Chris
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#8 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVu7UklIpHFf2WvFPKl0q3afGfK_Nbks5upJ3vgaJpZM4WbOcC>
> .
>
|
Hmmm. There is a tag in there that might refer to 8.1.9, but it's not clear if that corresponds directly to the F28 packages. At least it's a starting point. |
I'm generally with you on the "just use the whole GPU" approach. But the machine learning team doesn't like to take "because I say so" as a reason, and sometimes they're quite justified in that. With some more poking around, it looks like we can use prolog/epilog to manage this data for the well behaved programs that will account for the initial use cases. Hopefully we won't need to resort to a bunch of messing with the SGE internals. |
Do you just run serial GPU jobs? It seems to me that prolog and epilog
only run on master node for master process only, not slave process for
parallel jobs.
Best,
Feng
Best,
Feng
…On Sat, Oct 27, 2018 at 6:07 PM Christopher Heiny ***@***.***> wrote:
I'm generally with you on the "just use the whole GPU" approach. But the machine learning team doesn't like to take "because I say so" as a reason, and sometimes they're quite justified in that.
With some more poking around, it looks like we can use prolog/epilog to manage this data for the well behaved programs that will account for the initial use cases. Hopefully we won't need to resort to a bunch of messing with the SGE internals.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Good point, Feng. We run serial jobs, using one or more GPUs on a single
node. At some point we might get more ambitious, but I'll worry about that
bridge if/when we get to it.
… |
On Sat, 27 Oct 2018, Christopher Heiny wrote:
...
We're already using prolog/epilog scripts to manage GPUs on a per GPU basis
as you describe. But many use cases don't require all the resources on a
GPU, and uses are grumbling that they'd like to be able to share GPUs if
possible. This is could be done by extending the prolog/epilog, but I'd
like to reduce the amount of manual intervention and thought perhaps the
load sensor would help with that.
...
I keep wondering if each GPU vendor's virtualisation stuff could be used
to deal with this sort of thing, e.g. NVIDIA GRID, but don't have such a
card to play with. I'm coming at it from the try-to-make-VirtualGL-secure
and a-whole-card-is-overkill-for-much-remote-visualistion angles.
Coincidentally enough, I was about to extend our GPU handling too (we use
a daemon to allocate them - and other things - via the starter method), so
that we properly log some performance stats against jobs.
Mark
|
@njoly @bodgerer I'm wondering how much you guys know about GridEngine internals, or if you know anyone else who might? Dave Love seems to have disappeared again, and I'm wondering who else might have a deep knowledge of GridEngine and be willing to maintain a GitHub-based version of the project?
The text was updated successfully, but these errors were encountered: