Processing pipeline support #1004

janhoy · 2020-09-04T22:23:49Z

This PR fixes #1003 by letting users bring their own complex document processing.
Instead of trying to make a fancy configurable pipeline, it is much simpler to just let users provide a jar file with their custom ProcessingPipeline implementation. By default the DefaultProcessingPipeline is used, which simply parses by Tika and indexes.

A custom pipeline may e.g. do

Parse with TIka, but with custom configuration, metadata mapping, custom parsers etc
Clean and sanitize the resulting Doc
Enrich the ES document with further content, e.g. from external source, by adding an extraDoc on the context, that will get merged into the primary document right before indexing.

Add some documentation (not tested)

Add support for an 'extraDoc' on context that gets merged into 'doc' before indexing Add some documentation (not tested)

…ttings and esClient Also pass this to processors as needed. This makes the Context object only keep data related to the file being processed

janhoy · 2020-09-06T00:27:49Z

This approach can also help solve #858 more flexibly, by implementing your own CustomProcessingPipeline which would replace the call to TikaProcessor with a call to your own PlainTextExtractor. I think Tika is now pretty good at detecting plaintext, so normally it should do the job ootb, but this solution gives all the flexibility you need with a plugin and no changes to core code.

…er to subclass the `DefaultProcessingPipeline`, calling these two methods plus your own in addition. Add some javadoc

janhoy · 2020-09-06T16:33:29Z

@dadoonet Do you have any idea why travis fails here? It succeeds locally.

janhoy · 2020-09-07T18:53:59Z

I'm probably not going to continue on this PR, as we may not use fscrawler after all, but feel free to use it if you like!

dadoonet · 2020-11-10T17:08:49Z

@dadoonet Do you have any idea why travis fails here? It succeeds locally.

Apparently this is because of this line:

throw new RuntimeException("Could not create processing pipeline " + fsSettings.getFs().getPipeline().getClassName() + ". giving up");

I believe that the pipeline is null?

I'm probably not going to continue on this PR, as we may not use fscrawler after all, but feel free to use it if you like!

I think it's a very nice idea an addition. So I'll think about it when I'll be done with #991 :)

dadoonet · 2021-07-20T09:48:57Z

I looked at the code today (finally!) and it looks really good. Very smart implementation. I like it a lot.
That would give also some flexibility for parsing xml or for plain json documents as well.

I'm going to see how to adapt your branch to all the recent changes that happened in the meantime.

Thanks a lot @janhoy for your work on this!

smndtrl · 2023-03-01T10:43:01Z

Is there interest in reviving this?

We have a scenario where we need to modify the InputStream to strip certain things before passing it to the TikaParser and this seems like a nice way to keep our NDA for the proprietary method we have to use.

janhoy added 9 commits September 4, 2020 22:48

Processing pipeline support (dadoonet#1003).

7c53731

Move config into fs and actually use it

72aa0b7

Add some documentation (not tested)

Move config into fs and actually use it

a451ce1

Add support for an 'extraDoc' on context that gets merged into 'doc' before indexing Add some documentation (not tested)

Address some automated PR review feedback

0a37d83

Fix Codacy issues

225f348

More fixes

8a57fb7

More fixes

0b54155

Avoid nullpointer when pipeline not defined

1dc0912

Introduce a init() method that takes a config with settings like fsSe…

430de8b

…ttings and esClient Also pass this to processors as needed. This makes the Context object only keep data related to the file being processed

janhoy mentioned this pull request Sep 6, 2020

Enable custom Tika Parser #498

Closed

Remove empty constructor, add debug msg when a file enters the pieline

a7f8bf5

janhoy marked this pull request as draft September 6, 2020 00:40

janhoy added 5 commits September 6, 2020 16:16

Tikaprocessor should not close stream, it is done elsewhere

601112b

Extract Tika and indexing into separate methods, to make it even easi…

4b779d6

…er to subclass the `DefaultProcessingPipeline`, calling these two methods plus your own in addition. Add some javadoc

Revert some unnecessary changes, try to please codacy

c2d5e31

Remove unused import

d3c0272

Remove unused import

a4ab68a

janhoy added 3 commits September 7, 2020 16:37

TikaProcessor configurable for what metadata fields to extract

ca37bd3

Merge branch 'master' into pipeline

ef93e99

Throw exception if pipeline is not found

fe03ca8

dadoonet self-requested a review July 20, 2021 09:45

dadoonet self-assigned this Jul 20, 2021

dadoonet added the new For new features or options label Jul 20, 2021

smndtrl mentioned this pull request Mar 2, 2023

Add processing pipeline support #1619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing pipeline support #1004

Processing pipeline support #1004

janhoy commented Sep 4, 2020

janhoy commented Sep 6, 2020

janhoy commented Sep 6, 2020

janhoy commented Sep 7, 2020

dadoonet commented Nov 10, 2020 •

edited

Loading

dadoonet commented Jul 20, 2021

smndtrl commented Mar 1, 2023

Processing pipeline support #1004

Are you sure you want to change the base?

Processing pipeline support #1004

Conversation

janhoy commented Sep 4, 2020

janhoy commented Sep 6, 2020

janhoy commented Sep 6, 2020

janhoy commented Sep 7, 2020

dadoonet commented Nov 10, 2020 • edited Loading

dadoonet commented Jul 20, 2021

smndtrl commented Mar 1, 2023

dadoonet commented Nov 10, 2020 •

edited

Loading