-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-7745: Add storage plugin for IPFS #2084
base: master
Are you sure you want to change the base?
Conversation
@dbw9580 |
Yes, please. Some major problems I'm working on are:
|
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSContext.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSHelper.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSHelper.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSHelper.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSHelper.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSHelper.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSPeer.java
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSPeer.java
Show resolved
Hide resolved
...rib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePluginConfig.java
Outdated
Show resolved
Hide resolved
...rib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePluginConfig.java
Outdated
Show resolved
Hide resolved
Added small refactoring ideas. Did not check implementation details |
@dbw9580 You'll have to do that here: Lines 300 to 304 in 7d5b611
and here: drill/distribution/src/assemble/component.xml Lines 28 to 56 in 7d5b611
|
@dbw9580 |
Yes I'll try my best. I was stuck with CSV and writer support, and haven't had much progress so far. Can we settle with basic JSON and reader support for now, and maybe add those later? |
@dbw9580 I've never done a writer, but I can assist with the CSV reader. Take a look here: The HTTP storage plugin can read either |
Great. Will do.
Ok. The text reader module from the easy format plugin framework looks good, but I couldn't figure out a way to reuse that part of code in this plugin. Copy-pasting code is not accepted, I assume? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break out CSV and the writer support as separate PRs.
Great. Will do.
I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.
Ok. The text reader module from the easy format plugin framework looks good, but I couldn't figure out a way to reuse that part of code in this plugin. Copy-pasting code is not accepted, I assume?
Cutting/pasting really wouldn't be the best approach here although there unfortunately are many examples of it in Drill. :-/
The issue I ran into with the HTTP plugin was that I had an InputStream
and needed the text reader to accept an InputStream
rather than a file. I attempted to modify the ComplaintTextReader
to accept an InputStream
but it was really complicated and eventually gave up on that for the time being.
Your plugin looks similar in this regard. It looks like you're getting an InputStream
from IPFS. In theory you can then send that to any Drill format reader that accepts InputStreams
.
Bottom line, if you can refactor the CompliantTextReader
to accept an InputStream
then it becomes trivial to use this class for your plugin. If you can't, I'd suggest "borrowing" code from the HTTP CSV plugin.
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSJSONRecordReader.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSScanSpec.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSScanSpec.java
Show resolved
Hide resolved
contrib/storage-ipfs/README.md
Outdated
Start the service: | ||
``` | ||
systemd start drill-embedded.service | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a separate JIRA to add documentation to the Drill website. That does not have to be part of this PR, but otherwise nobody will know about this. ;-).
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSSubScan.java
Outdated
Show resolved
Hide resolved
...rib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePluginConfig.java
Outdated
Show resolved
Hide resolved
...rib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePluginConfig.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePlugin.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSStoragePlugin.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting and these revisions look good.
@dbw9580 That also might simplify the code a lot. |
contrib/storage-ipfs/README.zh.md
Outdated
@@ -0,0 +1,184 @@ | |||
# Drill Storage Plugin for IPFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep only README.md
in English, since it would be problematic to update it for other developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvysotskyi
Can they put a link to the Chinese version somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for this case, documentation outside the project also may be outdated.
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSContext.java
Outdated
Show resolved
Hide resolved
common/src/test/java/org/apache/drill/categories/IPFSStorageTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbw9580
This is definitely making progress. Will testing require an IPFS installation?
contrib/storage-ipfs/README.md
Outdated
|
||
A live demo: <http://www.datahub.pub/> hosted on a private cluster of Minerva. | ||
|
||
Note that it's still in early stages of development and the overall stability and performance is not satisfactory. PRs are very much welcome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we're ready to commit, can you please update the docs and remove all the language about building Drill etc.
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSCompat.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSCompat.java
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSContext.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSGroupScan.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSGroupScan.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSGroupScan.java
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSScanBatchCreator.java
Outdated
Show resolved
Hide resolved
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSSubScan.java
Show resolved
Hide resolved
contrib/storage-ipfs/src/test/java/org/apache/drill/exec/store/ipfs/IPFSTestBase.java
Show resolved
Hide resolved
Yes, a running IPFS daemon is required. |
Can you write some tests that do not require the IPFS daemon? I realize that we'll need it for some unit tests, but is it possible to test various components w/o the daemon? |
I guess for the storage plugin config, scanSpec, etc, they are static and independent on a running IPFS daemon. Also for the json reader, I think I can supply test data from memory directly, w/o actually retrieving it from IPFS. |
If I leave an instance of Drill running and then run the unit test ( |
@dbw9580 I think the reason this isn't doing what you're expecting is that in the I stepped through |
@cgivre Does Drill support connections from IPv6 sockets? Is it enabled by default or do I have to toggle some configuration items? The "protocol family unavailable" error could be due to lack of support for IPv6. |
The I believe the reason Drill didn't bind to the default ports is that those ports was used by the process from the last test run and had not been recycled by the system. If I wait for a minute or two before starting another round of testing, it's likely the test will pass. This is part of DRILL-7754, but I haven't come up with a plan to reliably store the ports info in IPFS. |
I don't see Drill binding to any IPv6 address in |
@dbw9580 |
Found that if I ran tests with |
contrib/storage-ipfs/src/main/java/org/apache/drill/exec/store/ipfs/IPFSSubScan.java
Outdated
Show resolved
Hide resolved
@dbw9580 |
@cgivre I think it's based on the current master already. |
@dbw9580
Thanks! |
@cgivre sure, but I have to do these tomorrow (it's now midnight in my timezone). And maybe allow some time for the IPFS API repo to release an official version: ipfs-shipyard/java-ipfs-http-client#172 (comment) ? |
.setState(DrillbitEndpoint.State.ONLINE) | ||
.build(); | ||
//DRILL-7777: how to safely remove endpoints that are no longer needed once the query is completed? | ||
ClusterCoordinator.RegistrationHandle handle = coordinator.register(ep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbw9580, could you please explain why it is required to register Drillbit endpoint? It is prohibited to do it everywhere except for the place when Drillbit is starting. When the endpoint is registered, it may be misused when executing other queries. Also, the same node may run several group scans, so it will fail for this case because required ports will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to tell Drill that these peers of IPFS who are also running Drill can be used when executing queries distributedly. So these Drillbit endpoints are created on the fly. Maybe limit these dynamically created endpoints to be used by this plugin only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drill should obtain all the info about the source of the data only from the query plan and minor fragments. Please take a look at existing storage plugins, or even file format plugins to see how it is implemented there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that unlike the other types of storage plugins, where the Drillbits are known at the time Drill starts, because they reside in the same cluster and are managed by a coordinator, the Drillbits in this plugin are remote IPFS peers which are associated with a particular query, thus can only be known when the user runs a query.
The complete workflow of this plugin is:
- the user inputs an SQL statement that specifies an IPFS path to the target table of the query;
- this plugin resolves the path to an IPFS object and finds its "providers", i.e. IFPS nodes which store the target object, and filters out those which are running Drill (Drill-ready);
- these nodes are registered as Drillbit endpoints, and the query plan is sent to them;
- these nodes execute the query plan and return results.
I made some slides to illustrate the basic idea: https://www.slideshare.net/BowenDing4/minerva-ipfs-storage-plugin-for-ipfs, starting on slide 10.
I understand how this storage plugin works may break Drill's existing model, but I couldn't find a plugin that works in a similar way, and the internal workings of Drill is too complex to go through. Could you please be more specific about how this plugin can be incompatible with other queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbw9580,
How would this plugin work if you were joining data from IPFS with data from another storage plugin? Would that break anything?
I'm wondering whether there is some way to mark an endpoint as for IPFS only or even for a particular query only so that it could not be misused and answer @vvysotskyi 's concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbw9580,
Perhaps try some variations like an aggregate query, or a query that combines multiple data source AND has subqueries. See if you can break it. ;-)
My hunch is that there we can find a way to resolve @vvysotskyi's concerns. This is a new (and interesting) use case, we should find a way to include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to produce a testcase that will cause it to fail. I tested successfully in a two node cluster with queries that involves data from the HTTP storage plugin, the classpath plugin and this plugin, combines join, filter and sort operators and nested subqueries. If @vvysotskyi could provide a testcase that shows these dynamically added endpoints can be a problem, I can look into that and see what solutions we can find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbw9580, the case I mentioned may be reproduced in the cluster with several nodes when the endpoint for this plugin is registered for the node with no running drillbit that belongs to the current Drill cluster. In this case, the endpoint for that node will be registered. Assume at this time, another query is submitted. Drill will try to send the plan fragment to this node, and this will cause problems since actually Drillbit is not running there.
Please take a look at the logic in BlockMapBuilder
, where it decides which drillbit will execute specific CompleteWork
instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue with unit tests you have previously observed was the case similar to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvysotskyi @cgivre I think to refactor the cluster coordinator and the planner is beyond me. I actually tried to create a endpoint registry that manages them according to the plugin they are registered with. But that didn't work out. The attempt is here: bdchain@85e7d2a
In 0c34b56 I added a manual switch in the config to allow the user to choose whether to run queries in distributed mode. If it is set to false, then no endpoints will be registered on the fly, but the plan will be executed by the foreman, just like the way the HTTP storage plugin works. I hope by doing so (and clarifying in the doc) we can mitigate the problem a bit.
@dbw9580 Alternatively, what would happen if user B executes a query while user A's IPFS queries are running. What would happen if user A's query completes before user B? Would it tear down the Drillbits and cause a crash? I'm asking because I really don't know here.. |
The test failure looks irrelevant. |
@dbw9580 |
Yes. I'm currently busy with other projects, and haven't had time to look further into this. I remember we were having some discussions about the way this plugin interacts with the Drill coordinator that needs a major design reconsideration. When I can spare more time on this, I will continue where I left off. |
@dbw9580 I was looking at this again after someone tagged me in an issue, but I think there is a WAY easier way to get this done. Take a look at this PR that I submitted to add the Dropbox file system to Drill: (#2337). It re-uses a lot of Drill's internals so that you don't have to deal with all the format readers. |
Add storage plugin for IPFS. See detailed introduction here.
TODOs:
Authors:
This storage plugin was contributed by Bowen Ding, Yuedong Xu and Liang Wang. The authors are affiliated with Fudan University.