Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running container using nvidia-docker2 #84

Open
kekeblom opened this issue Apr 22, 2020 · 2 comments
Open

Running container using nvidia-docker2 #84

kekeblom opened this issue Apr 22, 2020 · 2 comments

Comments

@kekeblom
Copy link

kekeblom commented Apr 22, 2020

I seem to have the same issue as here #74

I.e. when I run run_alignment_tool in the mounted data directory (using the provided sample data), I get: libGL error: No matching fbConfigs or visuals found. The director GUI pops up, but it's unable to open an OpenGL context.

I get the exact same error message when I run the glxgears test program from the mesa-utils package.

My understanding of the issue is that the OpenGL libraries that are inside the container do not match those which are running on the host computer or are unable to load.

I'm running nvidia-docker2 and my Docker version is 19.03.8, build afacb8b7f0. I'm running Nvidia driver version 440.64.00 on the host machine.

It seems Nvidia does not officially support glx on Nvidia docker. However, they do have cudagl images available here https://hub.docker.com/r/nvidia/cudagl. I'm not exactly sure which part of that image's docker file is key, but on that image, I am able to run glxgears i.e. OpenGL runs fine.

I could rebuild the container using that image. I tried that, but the container no longer builds. There is an error message related to vtk not being the right version. I can get around this by changing/adding -DUSE_SYSTEM_VTK:BOOL=OFF and -DUSE_PRECOMPILED_VTK=ON in the compile_all.sh script. However, the install then fails for some other reason which I didn't fully investigate. Other issues might come up though as the cuda version would get bumped up to 9 and some system packages might get updated.

Probably there is just some minor glitch on my system, which is why I'm opening this issue. A comment on the original issue I referenced, says "you should pull the nvidia-docker2 image, not the nvidia-docker1." and this seems to have resolved that persons issue. However, I don't exactly know what that means. I'm running nvidia-docker2 and I'm pulling the latest image from docker hub using nvidia-docker pull robotlocomotion/labelfusion.

Does anyone know what might be wrong here?

@kekeblom kekeblom changed the title Running container with using nvidia-docker2 Running container using nvidia-docker2 Apr 22, 2020
@kekeblom
Copy link
Author

You'll all be happy to hear that I was able to solve this issue. What seems to be happening is that the opengl libraries inside the container were not compatible with what is running on my system. I tried both the Nvidia drivers 390 and 440 but no luck. I'm not sure what the actual issue is, maybe it also has something to do with how the X server is configured on the host.

What worked in resolving this issue is installing libglvnd which is designed as a compatibility layer between the graphics libraries. It supports GLX which is used by LabelFusion. I derived a new image which is based on the labelfusion image and installed those libraries as they are installed in the official Nvidia cudagl images. All credit to them. See the end of the message for the exact Dockerfile I used.

It seems the current setup is quite reliant on how the host machine is set up. I would create a formal pull request updating the image, but unfortunately, I was unable to build the original image. Quite a few libraries seem to have updated and some of the dependencies no longer build.

It could be worth looking into updating e.g. Director to use a newer official version to reduce the risk of this software being left behind permanently. Would anyone more familiar with these projects be able to estimate how big an undertaking that would be? What would be the main issues?

Here is the Dockerfile I used to build my image.

FROM robotlocomotion/labelfusion:latest

RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates apt-transport-https gnupg-curl && \
    NVIDIA_GPGKEY_SUM=d1be581509378368edeec8c1eb2958702feedf3bc3d17011adbf24efacce4ab5 && \
    NVIDIA_GPGKEY_FPR=ae09fe4bbd223a84b2ccfce3f60f4b3d7fa2af80 && \
    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub && \
    apt-key adv --export --no-emit-version -a $NVIDIA_GPGKEY_FPR | tail -n +5 > cudasign.pub && \
    echo "$NVIDIA_GPGKEY_SUM  cudasign.pub" | sha256sum -c --strict - && rm cudasign.pub && \
    echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
    echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \
    apt-get purge --auto-remove -y gnupg-curl && \
rm -rf /var/lib/apt/lists/*

### OpenGL

RUN apt-get update && apt-get install -y --no-install-recommends \
        git \
        ca-certificates \
        make \
        automake \
        autoconf \
        libtool \
        pkg-config \
        python \
        libxext-dev \
        libx11-dev \
        x11proto-gl-dev && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /opt/libglvnd
RUN git clone --branch="v1.1.0" https://github.com/NVIDIA/libglvnd.git . && \
    ./autogen.sh && \
    ./configure --prefix=/usr/local --libdir=/usr/local/lib/x86_64-linux-gnu && \
    make -j"$(nproc)" install-strip && \
    find /usr/local/lib/x86_64-linux-gnu -type f -name 'lib*.la' -delete

RUN dpkg --add-architecture i386 && \
    apt-get update && apt-get install -y --no-install-recommends \
        gcc-multilib \
        libxext-dev:i386 \
        libx11-dev:i386 && \
    rm -rf /var/lib/apt/lists/*

# 32-bit libraries
RUN make distclean && \
    ./autogen.sh && \
    ./configure --prefix=/usr/local --libdir=/usr/local/lib/i386-linux-gnu --host=i386-linux-gnu "CFLAGS=-m32" "CXXFLAGS=-m32" "LDFLAGS=-m32" && \
    make -j"$(nproc)" install-strip && \
    find /usr/local/lib/i386-linux-gnu -type f -name 'lib*.la' -delete

COPY 10_nvidia.json /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json

RUN echo '/usr/local/lib/x86_64-linux-gnu' >> /etc/ld.so.conf.d/glvnd.conf && \
    echo '/usr/local/lib/i386-linux-gnu' >> /etc/ld.so.conf.d/glvnd.conf && \
    ldconfig

ENV LD_LIBRARY_PATH /usr/local/lib/x86_64-linux-gnu:/usr/local/lib/i386-linux-gnu${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

WORKDIR /root

ENTRYPOINT bash -c "source /root/labelfusion/docker/docker_startup.sh && /bin/bash"

@iamlucaswolf
Copy link

Thanks, @kekeblom , this resolved the issue for me! 👍🏻

Note that this expects 10_nvidia.json to be present in the docker build context. To this end, I replaced the last copy instruction with the one below, which should be a little more robust:

COPY --from=nvidia/opengl:1.0-glvnd-runtime-ubuntu16.04 \
  /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json \
  /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants