-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote IO: S3 support #479
Conversation
9287599
to
f82173e
Compare
a2308e2
to
ad92a7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work. I don't have any suggestions from a quick look. It'd be good to have a second pair of eyes, possibly from @wence-?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't get around to reviewing #464, I took this opportunity to familiarize myself with that code and then reviewed here.
@@ -67,6 +76,59 @@ cdef class RemoteFile: | |||
ret._handle = make_unique[cpp_RemoteHandle](move(ep), n) | |||
return ret | |||
|
|||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of logic repeated in these three functions and it would benefit from adding a single helper with most of the logic that you could just pass the endpoint created since the creation of ep
is really the only difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, but I cannot find a way to implement a from_unique_ptr
factory function without it getting very complicated.
Basically, I want something like this:
@staticmethod
cdef RemoteFile from_unique_ptr(
unique_ptr[cpp_RemoteHandle] handle,
nbytes: Optional[int]
):
cdef RemoteFile ret = RemoteFile.__new__(RemoteFile)
if nbytes is None:
ret._handle = make_unique[cpp_RemoteHandle](move(handle))
return ret
ret._handle = make_unique[cpp_RemoteHandle](move(handle), <size_t> nbytes)
return ret
But I cannot find a nice way to call from_unique_ptr()
with a derived class instances like unique_ptr[cpp_HttpEndpoint]
. Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps like this. I am not sure it is much cleaner:
def _set_handle(self, unique_ptr[cpp_RemoteEndpoint] ep, nbytes):
if nbytes is None:
self._handle = make_unique[cpp_RemoteHandle](move(ep))
else:
self._handle = make_unique[cpp_RemoteHandle](move(ep), <size_t>nbytes)
def open_http(...):
cdef RemoteFile ret = RemoteFile()
cdef unique_ptr[cpp_HttpEndpoint] ep = make_unique[cpp_HttpEndpoint](...)
ret._set_handle(<unique_ptr[cpp_RemoteEndpoint]>move(ep));
return ret;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental problem is that Cython does not natively understand smart pointer polymorphism in the same way that it understands raw pointer polymorphism, so you cannot pass a unique_ptr to a child class where it expects you to pass a unique_ptr to a base class or vice versa. It's a clear sign of the C rather than C++ roots of the project. Here's a patch that builds for me locally and is a bit cleaner IMHO than what Lawrence proposed, with the caveat that you temporarily have a raw pointer that you construct a unique_ptr from rather than using make_unique. Up to you two if you like it or not.
diff --git a/python/kvikio/kvikio/_lib/remote_handle.pyx b/python/kvikio/kvikio/_lib/remote_handle.pyx
index 1156300..de48bb9 100644
--- a/python/kvikio/kvikio/_lib/remote_handle.pyx
+++ b/python/kvikio/kvikio/_lib/remote_handle.pyx
@@ -20,13 +20,11 @@ cdef extern from "<kvikio/remote_handle.hpp>" nogil:
cdef cppclass cpp_RemoteEndpoint "kvikio::RemoteEndpoint":
pass
- cdef cppclass cpp_HttpEndpoint "kvikio::HttpEndpoint":
+ cdef cppclass cpp_HttpEndpoint "kvikio::HttpEndpoint"(cpp_RemoteEndpoint):
cpp_HttpEndpoint(string url) except +
- cdef cppclass cpp_S3Endpoint "kvikio::S3Endpoint":
+ cdef cppclass cpp_S3Endpoint "kvikio::S3Endpoint"(cpp_RemoteEndpoint):
cpp_S3Endpoint(string url) except +
-
- cdef cppclass cpp_S3Endpoint "kvikio::S3Endpoint":
cpp_S3Endpoint(string bucket_name, string object_name) except +
pair[string, string] cpp_parse_s3_url \
@@ -56,6 +54,19 @@ cdef string _to_string(str_or_none):
return str.encode(str(str_or_none))
+cdef RemoteFile make_remotefile(
+ cpp_RemoteEndpoint* ep,
+ nbytes: Optional[int],
+):
+ cdef RemoteFile ret = RemoteFile()
+ if nbytes is None:
+ ret._handle = make_unique[cpp_RemoteHandle](unique_ptr[cpp_RemoteEndpoint](ep))
+ return ret
+ cdef size_t n = nbytes
+ ret._handle = make_unique[cpp_RemoteHandle](unique_ptr[cpp_RemoteEndpoint](ep), n)
+ return ret
+
+
cdef class RemoteFile:
cdef unique_ptr[cpp_RemoteHandle] _handle
@@ -65,16 +76,10 @@ cdef class RemoteFile:
url: str,
nbytes: Optional[int],
):
- cdef RemoteFile ret = RemoteFile()
- cdef unique_ptr[cpp_HttpEndpoint] ep = make_unique[cpp_HttpEndpoint](
- _to_string(url)
+ return make_remotefile(
+ new cpp_HttpEndpoint(_to_string(url)),
+ nbytes,
)
- if nbytes is None:
- ret._handle = make_unique[cpp_RemoteHandle](move(ep))
- return ret
- cdef size_t n = nbytes
- ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
- return ret
@classmethod
def open_s3(
@@ -83,16 +88,10 @@ cdef class RemoteFile:
object_name: str,
nbytes: Optional[int],
):
- cdef RemoteFile ret = RemoteFile()
- cdef unique_ptr[cpp_S3Endpoint] ep = make_unique[cpp_S3Endpoint](
- _to_string(bucket_name), _to_string(object_name)
+ return make_remotefile(
+ new cpp_S3Endpoint(_to_string(bucket_name), _to_string(object_name)),
+ nbytes,
)
- if nbytes is None:
- ret._handle = make_unique[cpp_RemoteHandle](move(ep))
- return ret
- cdef size_t n = nbytes
- ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
- return ret
@classmethod
def open_s3_from_http_url(
@@ -100,16 +99,10 @@ cdef class RemoteFile:
url: str,
nbytes: Optional[int],
):
- cdef RemoteFile ret = RemoteFile()
- cdef unique_ptr[cpp_S3Endpoint] ep = make_unique[cpp_S3Endpoint](
- _to_string(url)
+ return make_remotefile(
+ new cpp_S3Endpoint(_to_string(url)),
+ nbytes,
)
- if nbytes is None:
- ret._handle = make_unique[cpp_RemoteHandle](move(ep))
- return ret
- cdef size_t n = nbytes
- ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
- return ret
@classmethod
def open_s3_from_s3_url(
@@ -118,16 +111,10 @@ cdef class RemoteFile:
nbytes: Optional[int],
):
cdef pair[string, string] bucket_and_object = cpp_parse_s3_url(_to_string(url))
- cdef RemoteFile ret = RemoteFile()
- cdef unique_ptr[cpp_S3Endpoint] ep = make_unique[cpp_S3Endpoint](
- bucket_and_object.first, bucket_and_object.second
+ return make_remotefile(
+ new cpp_S3Endpoint(bucket_and_object.first, bucket_and_object.second),
+ nbytes,
)
- if nbytes is None:
- ret._handle = make_unique[cpp_RemoteHandle](move(ep))
- return ret
- cdef size_t n = nbytes
- ret._handle = make_unique[cpp_RemoteHandle](move(ep), n)
- return ret
def nbytes(self) -> int:
return deref(self._handle).nbytes()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be a templated C++ factory function that we extern
to Cython and call here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do part of it in C++ but RemoteFile is a Python class so you'd still need a wrapper around the C++ function to handle instantiation and assignment to attributes of that object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think that is fine. Basically what we would be saying is here is a factory function that takes some arguments and returns a std::unique_ptr<cpp_RemoteHandle>
, which we just move
to ret._handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup that would be fine. Probably just use inline C++ to define that in this file itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a C++ cast function as @jakirkham suggest: 6104106
It is a bit more verbose than @vyasr's raw pointer approach, but it enforces the pointer uniqueness.
Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, and I am bikeshedding on the names. Though I would like it to be harder for the user to use an unencrypted URL.
Some suggestions for cleanup in the cython wrapper.
cpp/include/kvikio/remote_handle.hpp
Outdated
* @param url The full http url to the S3 file. NB: this should be an url starting with | ||
* "http://" or "https://". If you have an S3 url of the form "s3://<bucket>/<object>", | ||
* please use `S3Endpoint::parse_s3_url()` to convert it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we only use https
please and reject http
? Or do you want that for testing?
It looks like yes. I would like this interface to be "safe" by default, and so I would like the user to have to explicitly opt in to using an unencrypted link, given that we send secrets over the wire.
Also, how does parse_s3_url
help directly? That returns a std::pair
not a std::string
. Should one use url_from_bucket_and_object
on the result?
Should we error-check and raise if the URL doesn't start with https://
or http://
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we only use https please and reject http? Or do you want that for testing?
We also wants http for high performance access to public data.
It looks like yes. I would like this interface to be "safe" by default, and so I would like the user to have to explicitly opt in to using an unencrypted link, given that we send secrets over the wire.
NB: only a time specific signature are send over the wire, curl uses aws_secret_access_key
to generate the AWS authentication signature V4. Of cause, the payload is send unencrypted.
I think it is reasonable to use https by default and accept http if the user overwrite the endpoint url explicitly?
@@ -67,6 +76,59 @@ cdef class RemoteFile: | |||
ret._handle = make_unique[cpp_RemoteHandle](move(ep), n) | |||
return ret | |||
|
|||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps like this. I am not sure it is much cleaner:
def _set_handle(self, unique_ptr[cpp_RemoteEndpoint] ep, nbytes):
if nbytes is None:
self._handle = make_unique[cpp_RemoteHandle](move(ep))
else:
self._handle = make_unique[cpp_RemoteHandle](move(ep), <size_t>nbytes)
def open_http(...):
cdef RemoteFile ret = RemoteFile()
cdef unique_ptr[cpp_HttpEndpoint] ep = make_unique[cpp_HttpEndpoint](...)
ret._set_handle(<unique_ptr[cpp_RemoteEndpoint]>move(ep));
return ret;
…o remote-io-s3
Co-authored-by: Lawrence Mitchell <[email protected]> Co-authored-by: Vyas Ramasubramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mads
@vyasr do you have anything else ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some cosmetic suggestions but generally LGTM now.
{ | ||
std::stringstream ss; | ||
ss << access_key << ":" << secret_access_key; | ||
_aws_userpwd = ss.str(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have std::format in C++20...
Co-authored-by: Vyas Ramasubramani <[email protected]>
Thanks @vyasr |
/merge |
Since rapidsai#479, the libkvikio wheel now includes platform-specific files. Stop tagging the wheel as "any".
Since #479, the libkvikio wheel now includes platform-specific files. Stop tagging the wheel as "any". Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - Bradley Dice (https://github.com/bdice) URL: #507
Implements AWS S3 read support using libcurl:
Supersedes #426