DVC Evaluation #9851
willemkokke
started this conversation in
General
DVC Evaluation
#9851
Replies: 1 comment 1 reply
-
Hi @willemkokke! I'll take a shot at some initial thoughts on your questions:
Hope that helps! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've spent some time evaluating DVC as part of a potential solution to a few practical problems we are having authoring datasets.
We deal mostly with 3d meshes and their various textures, and we have artists authoring and cleaning them. These artists work geographically distributed all around the world, and giving them all access to machines in the office is not practical due to latency issues.
Allowing them to name files themselves, and/or require them to use command line tools is also pretty much a no go.
So the overall idea is to develop a QT interface, which allows them to checkout the part of the dataset they are meant to be working on to their local computer, make their changes and submit it back. This would allow us to have a perfect history of what changed when by who, as well being able to ensure everything is named correctly and present.
Our machine learning people can then use this dataset using
dvc import
ordvc import-url
in an unconstrained environment as they are expected to know what they are doing in that regard.DVC (+git) seems a very good starting point to base this on so far!
My first question hopefully is a simple one:
I've come across examples that set the
worktree
config property of a remote usingdvc remote modify
. It seems to be related toversion_aware
remotes, which I don't intend on using, but I'd still like to know what it's for. Neither the documentation or the source code were of great help figuring this out.As I'm hoping to remove the need to understand or be aware of git/dvc for the artists, and because there is nothing really mergeable in the sub projects anyway I intend to make checkouts of the project exclusive, ie. only one person at a time can have a project checked out for writing. This should make the workflow much more predictable and fool proof. I'm pretty sure this will have to be implemented on top of everything as neither git nor dvc has support for this (for good reasons). If anybody has come across an implementation of this or has any thoughts around it, I'd love to hear them.
I intend to have two remotes.
I need to make sure these remotes are kept in sync. For that I intend to setup a process in the office, which on every git commit will pull the latest version of the git repo, then do a
dvc pull
from the office remote first, and the s3 remote second if any files still are unresolved. Then vice versa advc push
to both remotes to ensure all files live on both remotes. This should ensure the remotes are out of sync for as short a duration as possible.Working inside the office, the gui can do an additional
dvc pull
from the S3 remote if an artist is checking out a project that is not "in sync" yet. People outside the office will need to wait as the office default remote is not available to them to use the same trick.Does this seem like a viable approach, or can you envision a better setup?
I think I've read everything there is about the current state of using different hash functions for the content addressable storage. I agree with the sentiment that md5 is far from the best choice if you are at all worried about having bad actors on the inside. I also appreciate why things are the way they are and that any changes need to be approached in a pragmatic and careful way. Personally I'd like to blake3 as it seems like the best trade-off between security and speed for this use case. As I would like to use this from the start of the project and not have to worry about migrating two 9TB copies of our dataset I'd be more than happy to contribute to this if somebody would be willing to spend 15 minutes talking to me to make sure I've got the right end of the stick. (there seem to have been a few attempts at this, one of them seems pretty far along but it's not clear to me what the exact state is)
This one is for the medium future: I can see a slightly more generalised version of this GUI be useful in many other places, especially if DVC works well together with git's sparse checkouts AND sparse index. For instance, in my line of work, we could have a monorepo containing all the large unreal / unity / maya / zbrush projects. My current assessment of that is that that is mostly dependent on support from some of the dependencies pulled in by scmrepo, like pygit2 and dulwich. Neither of them fully supports both. Are there other areas inside DVC that anyone knows off that would need modification to support this? I'm currently resisting the temptation to contribute to dulwich to improve this situation but I really should do more research to make sure this is time well spent ;)
I know this is a lot (I only came to ask about the worktree thing but I got carried away!) Feel free to ignore any or all of this, not trying to get anyone to do my work for me ;) )
Beta Was this translation helpful? Give feedback.
All reactions