Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-7745: Add storage plugin for IPFS #2084

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dbw9580
Copy link

@dbw9580 dbw9580 commented May 31, 2020

Add storage plugin for IPFS. See detailed introduction here.

TODOs:

  • Port to Drill 1.18.0
  • Add more tests
  • Support JSON reader

Authors:

This storage plugin was contributed by Bowen Ding, Yuedong Xu and Liang Wang. The authors are affiliated with Fudan University.

@cgivre
Copy link
Contributor

cgivre commented May 31, 2020

@dbw9580
Thanks for contributing this. Do you want review comments now?

@dbw9580
Copy link
Author

dbw9580 commented May 31, 2020

Yes, please.

Some major problems I'm working on are:

  • support for different data formats. Currently only JSON files are supported, and CSV support is removed due to changes introduced in v1.17 and v1.18. The original implementation was a copy-paste of the easy format plugin (org/apache/drill/exec/store/easy/text/TextFormatPlugin.java). I wonder if there is a better way to reuse the code there.
  • support for CREATE TABLE statements. This will require changes to the Drill framework responsible for SQL parsing, query plan generation, etc. I would appreciate opinions from you.

@sanel
Copy link

sanel commented May 31, 2020

Added small refactoring ideas. Did not check implementation details

@cgivre
Copy link
Contributor

cgivre commented Jun 2, 2020

@dbw9580
One more comment. You'll want to add your plugin to the distribution files so that it will be built when Drill is built.

You'll have to do that here:

drill/distribution/pom.xml

Lines 300 to 304 in 7d5b611

<groupId>org.apache.drill.contrib</groupId>
<artifactId>drill-opentsdb-storage</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>

and here:

<include>org.apache.drill.exec:drill-jdbc:jar</include>
<include>org.apache.drill:drill-protocol:jar</include>
<include>org.apache.drill:drill-common:jar</include>
<include>org.apache.drill:drill-logical:jar</include>
<include>org.apache.drill.exec:vector:jar</include>
<include>org.apache.drill.memory:drill-memory-base:jar</include>
<include>org.apache.drill.exec:drill-rpc:jar</include>
<include>org.apache.drill.exec:drill-java-exec:jar</include>
<include>org.apache.drill.metastore:drill-metastore-api:jar</include>
<include>org.apache.drill.metastore:drill-iceberg-metastore:jar</include>
<include>org.apache.drill.metastore:drill-rdbms-metastore:jar</include>
<include>org.apache.drill.contrib.storage-hive:drill-storage-hive-core:jar</include>
<include>org.apache.drill.contrib.storage-hive:drill-hive-exec-shaded:jar</include>
<include>org.apache.drill.contrib.data:tpch-sample-data:jar</include>
<include>org.apache.drill.contrib:drill-mongo-storage:jar</include>
<include>org.apache.drill.contrib:drill-storage-hbase:jar</include>
<include>org.apache.drill.contrib:drill-format-mapr:jar</include>
<include>org.apache.drill.contrib:drill-format-syslog:jar</include>
<include>org.apache.drill.contrib:drill-format-esri:jar</include>
<include>org.apache.drill.contrib:drill-format-hdf5:jar</include>
<include>org.apache.drill.contrib:drill-format-ltsv:jar</include>
<include>org.apache.drill.contrib:drill-format-excel:jar</include>
<include>org.apache.drill.contrib:drill-format-spss:jar</include>
<include>org.apache.drill.contrib:drill-jdbc-storage:jar</include>
<include>org.apache.drill.contrib:drill-kudu-storage:jar</include>
<include>org.apache.drill.contrib:drill-storage-kafka:jar</include>
<include>org.apache.drill.contrib:drill-storage-http:jar</include>
<include>org.apache.drill.contrib:drill-opentsdb-storage:jar</include>
<include>org.apache.drill.contrib:drill-udfs:jar</include>

@cgivre cgivre self-assigned this Jun 2, 2020
@cgivre
Copy link
Contributor

cgivre commented Jun 18, 2020

@dbw9580
We're looking to get a release of Drill out by the end of June. If you can get these revisions done soon, I can help expedite the review so that we can get this (or at least an MVP of this) into Drill 1.18.

@dbw9580
Copy link
Author

dbw9580 commented Jun 18, 2020

Yes I'll try my best. I was stuck with CSV and writer support, and haven't had much progress so far. Can we settle with basic JSON and reader support for now, and maybe add those later?

@cgivre
Copy link
Contributor

cgivre commented Jun 18, 2020

@dbw9580
Why don't you break out CSV and the writer support as separate PRs. Then we can get this in, and work on those for the next release.

I've never done a writer, but I can assist with the CSV reader. Take a look here:
https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpCSVBatchReader.java

The HTTP storage plugin can read either JSON or CSV. I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.

@dbw9580
Copy link
Author

dbw9580 commented Jun 18, 2020

break out CSV and the writer support as separate PRs.

Great. Will do.

I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.

Ok. The text reader module from the easy format plugin framework looks good, but I couldn't figure out a way to reuse that part of code in this plugin. Copy-pasting code is not accepted, I assume?

Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break out CSV and the writer support as separate PRs.

Great. Will do.

I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.

Ok. The text reader module from the easy format plugin framework looks good, but I couldn't figure out a way to reuse that part of code in this plugin. Copy-pasting code is not accepted, I assume?

Cutting/pasting really wouldn't be the best approach here although there unfortunately are many examples of it in Drill. :-/

The issue I ran into with the HTTP plugin was that I had an InputStream and needed the text reader to accept an InputStream rather than a file. I attempted to modify the ComplaintTextReader to accept an InputStream but it was really complicated and eventually gave up on that for the time being.

Your plugin looks similar in this regard. It looks like you're getting an InputStream from IPFS. In theory you can then send that to any Drill format reader that accepts InputStreams.

Bottom line, if you can refactor the CompliantTextReader to accept an InputStream then it becomes trivial to use this class for your plugin. If you can't, I'd suggest "borrowing" code from the HTTP CSV plugin.

Start the service:
```
systemd start drill-embedded.service
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a separate JIRA to add documentation to the Drill website. That does not have to be part of this PR, but otherwise nobody will know about this. ;-).

contrib/storage-ipfs/pom.xml Outdated Show resolved Hide resolved
Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting and these revisions look good.

@cgivre
Copy link
Contributor

cgivre commented Jun 19, 2020

@dbw9580
I had another thought about the design. I wonder if instead of "rolling your own" readers, if you could do what the file plugin does and make use of all the existing format plugins. Basically, that way any new file format for which there is a format plugin could also be read by IPFS. I don't know IPFS that well to know if that would even make sense, or not.

That also might simplify the code a lot.

@@ -0,0 +1,184 @@
# Drill Storage Plugin for IPFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep only README.md in English, since it would be problematic to update it for other developers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvysotskyi
Can they put a link to the Chinese version somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for this case, documentation outside the project also may be outdated.

contrib/storage-ipfs/pom.xml Outdated Show resolved Hide resolved
@dbw9580 dbw9580 marked this pull request as ready for review June 22, 2020 18:48
@cgivre cgivre changed the title [WIP] DRILL-7745: Add storage plugin for IPFS DRILL-7745: Add storage plugin for IPFS Jun 22, 2020
Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbw9580
This is definitely making progress. Will testing require an IPFS installation?


A live demo: <http://www.datahub.pub/> hosted on a private cluster of Minerva.

Note that it's still in early stages of development and the overall stability and performance is not satisfactory. PRs are very much welcome!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we're ready to commit, can you please update the docs and remove all the language about building Drill etc.

@dbw9580
Copy link
Author

dbw9580 commented Jun 23, 2020

@dbw9580
This is definitely making progress. Will testing require an IPFS installation?

Yes, a running IPFS daemon is required.

@cgivre
Copy link
Contributor

cgivre commented Jun 23, 2020

@dbw9580
This is definitely making progress. Will testing require an IPFS installation?

Yes, a running IPFS daemon is required.

Can you write some tests that do not require the IPFS daemon? I realize that we'll need it for some unit tests, but is it possible to test various components w/o the daemon?

@dbw9580
Copy link
Author

dbw9580 commented Jun 23, 2020

some tests that do not require the IPFS daemon

I guess for the storage plugin config, scanSpec, etc, they are static and independent on a running IPFS daemon. Also for the json reader, I think I can supply test data from memory directly, w/o actually retrieving it from IPFS.

@dbw9580
Copy link
Author

dbw9580 commented Aug 13, 2020

If I leave an instance of Drill running and then run the unit test (TestIPFSQueries), then it passes. I think the unit test does not actually build and run a full Drill server, which is why the connections are rejected.

@cgivre
Copy link
Contributor

cgivre commented Aug 13, 2020

@dbw9580
The ClusterTest class is supposed to start a Drill cluster so that you can execute queries. You should not need to have a Drill cluster running for the unit tests to complete.

I think the reason this isn't doing what you're expecting is that in the initIPFS function in IPFSTestSuit you are creating a plugin with a null configuration and hence isn't initializing correctly.

I stepped through testSimple() with the debugger and the dataset object is null, hence the test fails. My suspicion is that there is one small step being missed here. Could you please take a look and step through this to make sure that Drill is being initialized correctly?
Thanks

@dbw9580
Copy link
Author

dbw9580 commented Aug 14, 2020

@cgivre Does Drill support connections from IPv6 sockets? Is it enabled by default or do I have to toggle some configuration items? The "protocol family unavailable" error could be due to lack of support for IPv6.

@cgivre
Copy link
Contributor

cgivre commented Aug 14, 2020

@dbw9580 I believe Drill does support connections from IPv6 sockets. There was a recent PR for this in fact: (#1857) but I'm not sure if that is directly relevant.
Were you able to get it working?

@dbw9580
Copy link
Author

dbw9580 commented Aug 14, 2020

The connection rejected: /127.0.0.1:31011 failure was because sometimes Drill does not bind to the default ports (31010, 31011, 31012). It can bind to later ports like 31013, 31014, 31015, hence the connection was rejected.

I believe the reason Drill didn't bind to the default ports is that those ports was used by the process from the last test run and had not been recycled by the system. If I wait for a minute or two before starting another round of testing, it's likely the test will pass.

This is part of DRILL-7754, but I haven't come up with a plan to reliably store the ports info in IPFS.

@dbw9580
Copy link
Author

dbw9580 commented Aug 14, 2020

@dbw9580 I believe Drill does support connections from IPv6 sockets. There was a recent PR for this in fact: (#1857) but I'm not sure if that is directly relevant.
Were you able to get it working?

I don't see Drill binding to any IPv6 address in ss -6ltnp. I blocked IPv6 addresses in 9494a30 and the tests are now passing (most of the time, due to #2084 (comment)).

@cgivre
Copy link
Contributor

cgivre commented Aug 14, 2020

@dbw9580
The unit tests are passing now on my machine.

@dbw9580
Copy link
Author

dbw9580 commented Aug 14, 2020

@cgivre
I tried to set the ports to their default values in c090a43, but it did not seem to do the trick. Why is that?

@dbw9580
Copy link
Author

dbw9580 commented Aug 16, 2020

Found that if I ran tests with mvn clean test -DforkMode=never, then the port already in use errors were gone. Have no idea why.

@cgivre
Copy link
Contributor

cgivre commented Aug 16, 2020

Hi @dbw9580
Thanks for these updates. I didn't have any issues running your unit tests before this. However, I took a look at the Maven docs, and I'm wondering if you can specify the number of forks directly in the pom.xml file. 1

@dbw9580
Copy link
Author

dbw9580 commented Aug 16, 2020

Hi @dbw9580
Thanks for these updates. I didn't have any issues running your unit tests before this. However, I took a look at the Maven docs, and I'm wondering if you can specify the number of forks directly in the pom.xml file. 1

Thanks!!

@cgivre
Copy link
Contributor

cgivre commented Aug 17, 2020

@dbw9580
Can you please rebase on current master as well?

@dbw9580
Copy link
Author

dbw9580 commented Aug 17, 2020

@cgivre I think it's based on the current master already.

@cgivre
Copy link
Contributor

cgivre commented Aug 17, 2020

@dbw9580
This is looking pretty good. I'm going to do a final check this evening or tomorrow, but can you please:

  1. Squash all commits and add message of DRILL-7745: Add storage plugin for IPFS as the commit message
  2. Go through and do a final code hygiene check (make sure there are no extra spaces, commented out blocks etc). Drill does have a code formatter1 and just verify that the code complies with the coding standards for spacing and all that. (I didn't see anything jump out at me, but it always helps to double check)
  3. Please create a JIRA to add this to the public documentation. You're welcome to actually add the documentation as well, but for now, let's just make sure we have a JIRA on file to actually add the docs.

Thanks!

@dbw9580
Copy link
Author

dbw9580 commented Aug 17, 2020

@cgivre sure, but I have to do these tomorrow (it's now midnight in my timezone). And maybe allow some time for the IPFS API repo to release an official version: ipfs-shipyard/java-ipfs-http-client#172 (comment) ?

.setState(DrillbitEndpoint.State.ONLINE)
.build();
//DRILL-7777: how to safely remove endpoints that are no longer needed once the query is completed?
ClusterCoordinator.RegistrationHandle handle = coordinator.register(ep);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbw9580, could you please explain why it is required to register Drillbit endpoint? It is prohibited to do it everywhere except for the place when Drillbit is starting. When the endpoint is registered, it may be misused when executing other queries. Also, the same node may run several group scans, so it will fail for this case because required ports will be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to tell Drill that these peers of IPFS who are also running Drill can be used when executing queries distributedly. So these Drillbit endpoints are created on the fly. Maybe limit these dynamically created endpoints to be used by this plugin only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drill should obtain all the info about the source of the data only from the query plan and minor fragments. Please take a look at existing storage plugins, or even file format plugins to see how it is implemented there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that unlike the other types of storage plugins, where the Drillbits are known at the time Drill starts, because they reside in the same cluster and are managed by a coordinator, the Drillbits in this plugin are remote IPFS peers which are associated with a particular query, thus can only be known when the user runs a query.

The complete workflow of this plugin is:

  1. the user inputs an SQL statement that specifies an IPFS path to the target table of the query;
  2. this plugin resolves the path to an IPFS object and finds its "providers", i.e. IFPS nodes which store the target object, and filters out those which are running Drill (Drill-ready);
  3. these nodes are registered as Drillbit endpoints, and the query plan is sent to them;
  4. these nodes execute the query plan and return results.

I made some slides to illustrate the basic idea: https://www.slideshare.net/BowenDing4/minerva-ipfs-storage-plugin-for-ipfs, starting on slide 10.

I understand how this storage plugin works may break Drill's existing model, but I couldn't find a plugin that works in a similar way, and the internal workings of Drill is too complex to go through. Could you please be more specific about how this plugin can be incompatible with other queries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbw9580,
How would this plugin work if you were joining data from IPFS with data from another storage plugin? Would that break anything?

I'm wondering whether there is some way to mark an endpoint as for IPFS only or even for a particular query only so that it could not be misused and answer @vvysotskyi 's concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbw9580,
Perhaps try some variations like an aggregate query, or a query that combines multiple data source AND has subqueries. See if you can break it. ;-)

My hunch is that there we can find a way to resolve @vvysotskyi's concerns. This is a new (and interesting) use case, we should find a way to include it.

Copy link
Author

@dbw9580 dbw9580 Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to produce a testcase that will cause it to fail. I tested successfully in a two node cluster with queries that involves data from the HTTP storage plugin, the classpath plugin and this plugin, combines join, filter and sort operators and nested subqueries. If @vvysotskyi could provide a testcase that shows these dynamically added endpoints can be a problem, I can look into that and see what solutions we can find.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbw9580, the case I mentioned may be reproduced in the cluster with several nodes when the endpoint for this plugin is registered for the node with no running drillbit that belongs to the current Drill cluster. In this case, the endpoint for that node will be registered. Assume at this time, another query is submitted. Drill will try to send the plan fragment to this node, and this will cause problems since actually Drillbit is not running there.
Please take a look at the logic in BlockMapBuilder, where it decides which drillbit will execute specific CompleteWork instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue with unit tests you have previously observed was the case similar to this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvysotskyi @cgivre I think to refactor the cluster coordinator and the planner is beyond me. I actually tried to create a endpoint registry that manages them according to the plugin they are registered with. But that didn't work out. The attempt is here: bdchain@85e7d2a
In 0c34b56 I added a manual switch in the config to allow the user to choose whether to run queries in distributed mode. If it is set to false, then no endpoints will be registered on the fly, but the plan will be executed by the foreman, just like the way the HTTP storage plugin works. I hope by doing so (and clarifying in the doc) we can mitigate the problem a bit.

@cgivre
Copy link
Contributor

cgivre commented Aug 20, 2020

@dbw9580
What would happen in the following scenario. Let's say you have user A who executes a query using IPFS and which spins up new Drillbits. User B then decides to execute a query that does not use IPFS. Is it possible that if these two queries are concurrent, could user B's query end up on the Drillbits for IPFS and then either not find data or cause some other problem?

Alternatively, what would happen if user B executes a query while user A's IPFS queries are running. What would happen if user A's query completes before user B? Would it tear down the Drillbits and cause a crash?

I'm asking because I really don't know here..

@dbw9580
Copy link
Author

dbw9580 commented Sep 6, 2020

The test failure looks irrelevant.

@cgivre
Copy link
Contributor

cgivre commented Dec 1, 2020

@dbw9580
Hi there! I hope all is well. Are you still interested in completing this PR?

@dbw9580
Copy link
Author

dbw9580 commented Dec 3, 2020

@dbw9580
Hi there! I hope all is well. Are you still interested in completing this PR?

Yes. I'm currently busy with other projects, and haven't had time to look further into this. I remember we were having some discussions about the way this plugin interacts with the Drill coordinator that needs a major design reconsideration. When I can spare more time on this, I will continue where I left off.

@cgivre
Copy link
Contributor

cgivre commented Apr 26, 2022

@dbw9580 I was looking at this again after someone tagged me in an issue, but I think there is a WAY easier way to get this done. Take a look at this PR that I submitted to add the Dropbox file system to Drill: (#2337). It re-uses a lot of Drill's internals so that you don't have to deal with all the format readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation PRs relating to documentation enhancement PRs that add a new functionality to Drill new-storage New Storage Plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants