Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error : Check failed: error == cudaSuccess (30 vs. 0) unknown error #120

Open
rahulbhalerao001 opened this issue Apr 4, 2016 · 4 comments

Comments

@rahulbhalerao001
Copy link
Contributor

I had created a private AMI from a running code (after the cache changes), and the Imagenet example was running correctly on this AMI.

However, today I created a new cluster from this AMI and got the error - "Check failed: error == cudaSuccess (30 vs. 0) unknown error".

@robertnishihara
Copy link
Member

Did you get the error when running ImageNetApp.scala? Was that the only difference? What sort of nodes were you using, and what OS? Were you using the current SparkNet master, or did you modify anything?

@rahulbhalerao001
Copy link
Contributor Author

Yes Imagenet.scala. After posting, I pulled, rebuilt, and re ran with same result.
Node - g2.8xlarge with ubuntu 14.04 (same as the AMI provided here)
No, I did not modify.

@rahulbhalerao001
Copy link
Contributor Author

I made a private AMI, because I wanted to stick to one version of the code.

@robertnishihara
Copy link
Member

I'm not sure exactly what the problem is, but one starting point is to figure out exactly where the error is occuring. A couple ways to do this:

  1. Start a spark shell with something like ~/spark/bin/spark-shell --jars /root/SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar and try loading a model from a model file and creating a net and calling net.forward, and see precisely where it crashes. For this purpose, you can do all of this on the Spark master and you don't need to create a net on each worker.
  2. Sometimes running things in the Spark shell is different from running a script, so I'd suggest commenting out components of CifarApp.scala until you stop getting the error to find the minimal example that causes it to fail.

By the way, you said the problem was with ImageNetApp, but does it also occur with CifarApp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants