-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing pipeline support #1004
base: master
Are you sure you want to change the base?
Conversation
Add some documentation (not tested)
Add support for an 'extraDoc' on context that gets merged into 'doc' before indexing Add some documentation (not tested)
…ttings and esClient Also pass this to processors as needed. This makes the Context object only keep data related to the file being processed
This approach can also help solve #858 more flexibly, by implementing your own |
…er to subclass the `DefaultProcessingPipeline`, calling these two methods plus your own in addition. Add some javadoc
@dadoonet Do you have any idea why travis fails here? It succeeds locally. |
I'm probably not going to continue on this PR, as we may not use fscrawler after all, but feel free to use it if you like! |
Apparently this is because of this line:
I believe that the pipeline is
I think it's a very nice idea an addition. So I'll think about it when I'll be done with #991 :) |
I looked at the code today (finally!) and it looks really good. Very smart implementation. I like it a lot. I'm going to see how to adapt your branch to all the recent changes that happened in the meantime. Thanks a lot @janhoy for your work on this! |
Is there interest in reviving this? We have a scenario where we need to modify the |
This PR fixes #1003 by letting users bring their own complex document processing.
Instead of trying to make a fancy configurable pipeline, it is much simpler to just let users provide a jar file with their custom
ProcessingPipeline
implementation. By default theDefaultProcessingPipeline
is used, which simply parses by Tika and indexes.A custom pipeline may e.g. do
Doc
extraDoc
on the context, that will get merged into the primary document right before indexing.