Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing pipeline support #1004

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from
Draft

Conversation

janhoy
Copy link
Contributor

@janhoy janhoy commented Sep 4, 2020

This PR fixes #1003 by letting users bring their own complex document processing.
Instead of trying to make a fancy configurable pipeline, it is much simpler to just let users provide a jar file with their custom ProcessingPipeline implementation. By default the DefaultProcessingPipeline is used, which simply parses by Tika and indexes.

A custom pipeline may e.g. do

  • Parse with TIka, but with custom configuration, metadata mapping, custom parsers etc
  • Clean and sanitize the resulting Doc
  • Enrich the ES document with further content, e.g. from external source, by adding an extraDoc on the context, that will get merged into the primary document right before indexing.

Add some documentation (not tested)
Add support for an 'extraDoc' on context that gets merged into 'doc' before indexing
Add some documentation (not tested)
…ttings and esClient

Also pass this to processors as needed. This makes the Context object only keep data related to the file being processed
@janhoy janhoy mentioned this pull request Sep 6, 2020
@janhoy
Copy link
Contributor Author

janhoy commented Sep 6, 2020

This approach can also help solve #858 more flexibly, by implementing your own CustomProcessingPipeline which would replace the call to TikaProcessor with a call to your own PlainTextExtractor. I think Tika is now pretty good at detecting plaintext, so normally it should do the job ootb, but this solution gives all the flexibility you need with a plugin and no changes to core code.

@janhoy janhoy marked this pull request as draft September 6, 2020 00:40
@janhoy
Copy link
Contributor Author

janhoy commented Sep 6, 2020

@dadoonet Do you have any idea why travis fails here? It succeeds locally.

@janhoy
Copy link
Contributor Author

janhoy commented Sep 7, 2020

I'm probably not going to continue on this PR, as we may not use fscrawler after all, but feel free to use it if you like!

@dadoonet
Copy link
Owner

dadoonet commented Nov 10, 2020

@dadoonet Do you have any idea why travis fails here? It succeeds locally.

Apparently this is because of this line:

throw new RuntimeException("Could not create processing pipeline " + fsSettings.getFs().getPipeline().getClassName() + ". giving up");

I believe that the pipeline is null?

I'm probably not going to continue on this PR, as we may not use fscrawler after all, but feel free to use it if you like!

I think it's a very nice idea an addition. So I'll think about it when I'll be done with #991 :)

@dadoonet dadoonet self-requested a review July 20, 2021 09:45
@dadoonet dadoonet self-assigned this Jul 20, 2021
@dadoonet dadoonet added the new For new features or options label Jul 20, 2021
@dadoonet
Copy link
Owner

I looked at the code today (finally!) and it looks really good. Very smart implementation. I like it a lot.
That would give also some flexibility for parsing xml or for plain json documents as well.

I'm going to see how to adapt your branch to all the recent changes that happened in the meantime.

Thanks a lot @janhoy for your work on this!

@smndtrl
Copy link

smndtrl commented Mar 1, 2023

Is there interest in reviving this?

We have a scenario where we need to modify the InputStream to strip certain things before passing it to the TikaParser and this seems like a nice way to keep our NDA for the proprietary method we have to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new For new features or options
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide hooks for custom processing
3 participants