Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote IO: S3 support #479

Merged
merged 27 commits into from
Oct 22, 2024
Merged

Remote IO: S3 support #479

merged 27 commits into from
Oct 22, 2024

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Oct 2, 2024

Implements AWS S3 read support using libcurl:

import kvikio
import cupy

with kvikio.RemoteFile.from_s3_url("s://my-bucket/my-file") as f:
    ary = cupy.empty(f.nbytes, dtype="uint8")
    f.read(ary)

Supersedes #426

@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 2, 2024
madsbk added a commit to madsbk/cudf that referenced this pull request Oct 2, 2024
madsbk added a commit to madsbk/cudf that referenced this pull request Oct 2, 2024
@madsbk madsbk force-pushed the remote-io-s3 branch 2 times, most recently from 9287599 to f82173e Compare October 7, 2024 08:09
@madsbk madsbk force-pushed the remote-io-s3 branch 2 times, most recently from a2308e2 to ad92a7d Compare October 9, 2024 12:34
@madsbk madsbk marked this pull request as ready for review October 9, 2024 13:56
@madsbk madsbk requested review from a team as code owners October 9, 2024 13:56
@madsbk madsbk requested a review from AyodeAwe October 9, 2024 13:56
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work. I don't have any suggestions from a quick look. It'd be good to have a second pair of eyes, possibly from @wence-?

@madsbk
Copy link
Member Author

madsbk commented Oct 10, 2024

Very nice work. I don't have any suggestions from a quick look. It'd be good to have a second pair of eyes, possibly from @wence-?

Thanks @bdice.
Agree, @wence- if you can find time, it would be good with a review.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't get around to reviewing #464, I took this opportunity to familiarize myself with that code and then reviewed here.

cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
python/kvikio/kvikio/_lib/remote_handle.pyx Outdated Show resolved Hide resolved
@@ -67,6 +76,59 @@ cdef class RemoteFile:
ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
return ret

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of logic repeated in these three functions and it would benefit from adding a single helper with most of the logic that you could just pass the endpoint created since the creation of ep is really the only difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but I cannot find a way to implement a from_unique_ptr factory function without it getting very complicated.
Basically, I want something like this:

    @staticmethod
    cdef RemoteFile from_unique_ptr(
        unique_ptr[cpp_RemoteHandle] handle,
        nbytes: Optional[int]
    ):
        cdef RemoteFile ret = RemoteFile.__new__(RemoteFile)
        if nbytes is None:
            ret._handle = make_unique[cpp_RemoteHandle](move(handle))
            return ret
        ret._handle = make_unique[cpp_RemoteHandle](move(handle), <size_t> nbytes)
        return ret

But I cannot find a nice way to call from_unique_ptr() with a derived class instances like unique_ptr[cpp_HttpEndpoint]. Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps like this. I am not sure it is much cleaner:

def _set_handle(self, unique_ptr[cpp_RemoteEndpoint] ep, nbytes):
    if nbytes is None:
        self._handle = make_unique[cpp_RemoteHandle](move(ep))
    else:
        self._handle = make_unique[cpp_RemoteHandle](move(ep), <size_t>nbytes)

def open_http(...):
    cdef RemoteFile ret = RemoteFile()
    cdef unique_ptr[cpp_HttpEndpoint] ep = make_unique[cpp_HttpEndpoint](...)
    ret._set_handle(<unique_ptr[cpp_RemoteEndpoint]>move(ep));
    return ret;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental problem is that Cython does not natively understand smart pointer polymorphism in the same way that it understands raw pointer polymorphism, so you cannot pass a unique_ptr to a child class where it expects you to pass a unique_ptr to a base class or vice versa. It's a clear sign of the C rather than C++ roots of the project. Here's a patch that builds for me locally and is a bit cleaner IMHO than what Lawrence proposed, with the caveat that you temporarily have a raw pointer that you construct a unique_ptr from rather than using make_unique. Up to you two if you like it or not.

diff --git a/python/kvikio/kvikio/_lib/remote_handle.pyx b/python/kvikio/kvikio/_lib/remote_handle.pyx
index 1156300..de48bb9 100644
--- a/python/kvikio/kvikio/_lib/remote_handle.pyx
+++ b/python/kvikio/kvikio/_lib/remote_handle.pyx
@@ -20,13 +20,11 @@ cdef extern from "<kvikio/remote_handle.hpp>" nogil:
     cdef cppclass cpp_RemoteEndpoint "kvikio::RemoteEndpoint":
         pass
 
-    cdef cppclass cpp_HttpEndpoint "kvikio::HttpEndpoint":
+    cdef cppclass cpp_HttpEndpoint "kvikio::HttpEndpoint"(cpp_RemoteEndpoint):
         cpp_HttpEndpoint(string url) except +
 
-    cdef cppclass cpp_S3Endpoint "kvikio::S3Endpoint":
+    cdef cppclass cpp_S3Endpoint "kvikio::S3Endpoint"(cpp_RemoteEndpoint):
         cpp_S3Endpoint(string url) except +
-
-    cdef cppclass cpp_S3Endpoint "kvikio::S3Endpoint":
         cpp_S3Endpoint(string bucket_name, string object_name) except +
 
     pair[string, string] cpp_parse_s3_url \
@@ -56,6 +54,19 @@ cdef string _to_string(str_or_none):
     return str.encode(str(str_or_none))
 
 
+cdef RemoteFile make_remotefile(
+    cpp_RemoteEndpoint* ep,
+    nbytes: Optional[int],
+):
+    cdef RemoteFile ret = RemoteFile()
+    if nbytes is None:
+        ret._handle = make_unique[cpp_RemoteHandle](unique_ptr[cpp_RemoteEndpoint](ep))
+        return ret
+    cdef size_t n = nbytes
+    ret._handle = make_unique[cpp_RemoteHandle](unique_ptr[cpp_RemoteEndpoint](ep), n)
+    return ret
+
+
 cdef class RemoteFile:
     cdef unique_ptr[cpp_RemoteHandle] _handle
 
@@ -65,16 +76,10 @@ cdef class RemoteFile:
         url: str,
         nbytes: Optional[int],
     ):
-        cdef RemoteFile ret = RemoteFile()
-        cdef unique_ptr[cpp_HttpEndpoint] ep = make_unique[cpp_HttpEndpoint](
-            _to_string(url)
+        return make_remotefile(
+            new cpp_HttpEndpoint(_to_string(url)),
+            nbytes,
         )
-        if nbytes is None:
-            ret._handle = make_unique[cpp_RemoteHandle](move(ep))
-            return ret
-        cdef size_t n = nbytes
-        ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
-        return ret
 
     @classmethod
     def open_s3(
@@ -83,16 +88,10 @@ cdef class RemoteFile:
         object_name: str,
         nbytes: Optional[int],
     ):
-        cdef RemoteFile ret = RemoteFile()
-        cdef unique_ptr[cpp_S3Endpoint] ep = make_unique[cpp_S3Endpoint](
-            _to_string(bucket_name), _to_string(object_name)
+        return make_remotefile(
+            new cpp_S3Endpoint(_to_string(bucket_name), _to_string(object_name)),
+            nbytes,
         )
-        if nbytes is None:
-            ret._handle = make_unique[cpp_RemoteHandle](move(ep))
-            return ret
-        cdef size_t n = nbytes
-        ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
-        return ret
 
     @classmethod
     def open_s3_from_http_url(
@@ -100,16 +99,10 @@ cdef class RemoteFile:
         url: str,
         nbytes: Optional[int],
     ):
-        cdef RemoteFile ret = RemoteFile()
-        cdef unique_ptr[cpp_S3Endpoint] ep = make_unique[cpp_S3Endpoint](
-            _to_string(url)
+        return make_remotefile(
+            new cpp_S3Endpoint(_to_string(url)),
+            nbytes,
         )
-        if nbytes is None:
-            ret._handle = make_unique[cpp_RemoteHandle](move(ep))
-            return ret
-        cdef size_t n = nbytes
-        ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
-        return ret
 
     @classmethod
     def open_s3_from_s3_url(
@@ -118,16 +111,10 @@ cdef class RemoteFile:
         nbytes: Optional[int],
     ):
         cdef pair[string, string] bucket_and_object = cpp_parse_s3_url(_to_string(url))
-        cdef RemoteFile ret = RemoteFile()
-        cdef unique_ptr[cpp_S3Endpoint] ep = make_unique[cpp_S3Endpoint](
-            bucket_and_object.first, bucket_and_object.second
+        return make_remotefile(
+            new cpp_S3Endpoint(bucket_and_object.first, bucket_and_object.second),
+            nbytes,
         )
-        if nbytes is None:
-            ret._handle = make_unique[cpp_RemoteHandle](move(ep))
-            return ret
-        cdef size_t n = nbytes
-        ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
-        return ret
 
     def nbytes(self) -> int:
         return deref(self._handle).nbytes()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be a templated C++ factory function that we extern to Cython and call here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do part of it in C++ but RemoteFile is a Python class so you'd still need a wrapper around the C++ function to handle instantiation and assignment to attributes of that object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think that is fine. Basically what we would be saying is here is a factory function that takes some arguments and returns a std::unique_ptr<cpp_RemoteHandle>, which we just move to ret._handle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that would be fine. Probably just use inline C++ to define that in this file itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a C++ cast function as @jakirkham suggest: 6104106
It is a bit more verbose than @vyasr's raw pointer approach, but it enforces the pointer uniqueness.

python/kvikio/kvikio/benchmarks/s3_io.py Show resolved Hide resolved
madsbk and others added 2 commits October 11, 2024 08:39
Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, and I am bikeshedding on the names. Though I would like it to be harder for the user to use an unencrypted URL.

Some suggestions for cleanup in the cython wrapper.

cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Show resolved Hide resolved
Comment on lines 249 to 251
* @param url The full http url to the S3 file. NB: this should be an url starting with
* "http://" or "https://". If you have an S3 url of the form "s3://<bucket>/<object>",
* please use `S3Endpoint::parse_s3_url()` to convert it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we only use https please and reject http? Or do you want that for testing?

It looks like yes. I would like this interface to be "safe" by default, and so I would like the user to have to explicitly opt in to using an unencrypted link, given that we send secrets over the wire.

Also, how does parse_s3_url help directly? That returns a std::pair not a std::string. Should one use url_from_bucket_and_object on the result?

Should we error-check and raise if the URL doesn't start with https:// or http://?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we only use https please and reject http? Or do you want that for testing?

We also wants http for high performance access to public data.

It looks like yes. I would like this interface to be "safe" by default, and so I would like the user to have to explicitly opt in to using an unencrypted link, given that we send secrets over the wire.

NB: only a time specific signature are send over the wire, curl uses aws_secret_access_key to generate the AWS authentication signature V4. Of cause, the payload is send unencrypted.

I think it is reasonable to use https by default and accept http if the user overwrite the endpoint url explicitly?

cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
@@ -67,6 +76,59 @@ cdef class RemoteFile:
ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
return ret

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps like this. I am not sure it is much cleaner:

def _set_handle(self, unique_ptr[cpp_RemoteEndpoint] ep, nbytes):
    if nbytes is None:
        self._handle = make_unique[cpp_RemoteHandle](move(ep))
    else:
        self._handle = make_unique[cpp_RemoteHandle](move(ep), <size_t>nbytes)

def open_http(...):
    cdef RemoteFile ret = RemoteFile()
    cdef unique_ptr[cpp_HttpEndpoint] ep = make_unique[cpp_HttpEndpoint](...)
    ret._set_handle(<unique_ptr[cpp_RemoteEndpoint]>move(ep));
    return ret;

python/kvikio/kvikio/_lib/remote_handle.pyx Outdated Show resolved Hide resolved
python/kvikio/kvikio/benchmarks/s3_io.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mads

@madsbk madsbk requested a review from vyasr October 21, 2024 15:17
@madsbk
Copy link
Member Author

madsbk commented Oct 21, 2024

@vyasr do you have anything else ?

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cosmetic suggestions but generally LGTM now.

cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved
{
std::stringstream ss;
ss << access_key << ":" << secret_access_key;
_aws_userpwd = ss.str();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have std::format in C++20...

python/kvikio/kvikio/_lib/remote_handle.pyx Outdated Show resolved Hide resolved
@vyasr vyasr mentioned this pull request Oct 22, 2024
3 tasks
Co-authored-by: Vyas Ramasubramani <[email protected]>
@madsbk
Copy link
Member Author

madsbk commented Oct 22, 2024

Thanks @vyasr

@madsbk
Copy link
Member Author

madsbk commented Oct 22, 2024

/merge

@rapids-bot rapids-bot bot merged commit fcf4b15 into rapidsai:branch-24.12 Oct 22, 2024
57 checks passed
KyleFromNVIDIA added a commit to KyleFromNVIDIA/kvikio that referenced this pull request Oct 23, 2024
Since rapidsai#479, the libkvikio
wheel now includes platform-specific files. Stop tagging the wheel
as "any".
rapids-bot bot pushed a commit that referenced this pull request Oct 23, 2024
Since #479, the libkvikio wheel now includes platform-specific files. Stop tagging the wheel as "any".

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #507
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants