Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable custom Tika Parser #498

Closed
wants to merge 12 commits into from
Closed

Enable custom Tika Parser #498

wants to merge 12 commits into from

Conversation

jevertz
Copy link

@jevertz jevertz commented Feb 2, 2018

This makes it possible to use a custom Tika Parser.
To use the parser, just an addition to _settings.json is necessary.
Here's an example

    "custom_tika_parsers": [
      {
        "class_name": "com.monotype.OTParser",
        "path_to_jar": "/Users/jevertz/workspace/testfscrawler/svn-search/customTikaParsers/target/otparser-0.0.1-SNAPSHOT.jar",
        "mime_types": ["application/x-font-ttf", "application/x-font-otf"]
      }
    ]

The jar file does not need to be on the class path.
The parser needs to be a Tika parser, like described here

@dadoonet
Copy link
Owner

dadoonet commented Feb 2, 2018

Very interesting. I like the idea.

I wonder if it's absolutely needed to provide the tika-config.xml file BTW.

I'll give a proper review later but thanks for sharing your code with this PR!

@jevertz
Copy link
Author

jevertz commented Feb 3, 2018 via email

@dadoonet
Copy link
Owner

Hi @jevertz

Would you mind rebasing your code on latest master branch?

@dadoonet dadoonet added this to the 2.5 milestone Feb 15, 2018
@dadoonet dadoonet added the new For new features or options label Feb 15, 2018
@dadoonet
Copy link
Owner

Also I think it would be good to document that change in the README.

@jevertz
Copy link
Author

jevertz commented Feb 19, 2018 via email

@dadoonet dadoonet self-requested a review February 22, 2018 13:46
@dadoonet dadoonet self-assigned this Feb 22, 2018
Copy link
Owner

@dadoonet dadoonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks very promising. Thanks!

I added a lot of "format" comments.
In the README, I'm always trying to print a full setting configuration in https://github.com/dadoonet/fscrawler#job-file-specification
May be you could add your changes as well here. IIRC I'm using part of the output of the FsSettingsParserTest execution.

README.md Outdated
It might occur that one or more existing Tika parsers do not provide the intended information, or just do not exist.
This setting allows to use a custom parser instead.
The parsers must be provided as a .jar, but does not need to be on any classpath.
Note that this is an array. Here an example for just one
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Here an example for just one

to

Here an example for just one:

README.md Outdated

Some info about creating a custom parser is available [here](https://tika.apache.org/1.17/parser_guide.html)
Or use a existing parser as a blueprint. Make sure to choose the correct branch.
At the time of this writing fscrawler uses Tika 1.17, while on github the main branch is 2.x.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main branch is 2.x

to

the main Tika branch is 2.x

README.md Outdated
The parsers from ["branch_1x"](https://github.com/apache/tika/tree/branch_1x/tika-parsers/src/main/java/org/apache/tika/parser) should work fine.

To build the custom parser separately, a pom file can be derived from the tika-parsers [pom.xml](https://github.com/apache/tika/blob/branch_1x/tika-parsers/pom.xml).
Probably a lot can be left out. Here is an example which required fontbox (guess still to long, but worked).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM with "guess still to long, but worked"

too long? Is it a comment you are making here? If so I believe this is not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to indicate that I didn’t really evaluated the Pom.xml, but stopped when it build without errors.
Should get marked someway as to-do

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to
(The exclusions are copied 1:1 from fscrawler's pom.xml, to be on the safe side)

<groupId>com.uwyn</groupId>
<artifactId>jhighlight</artifactId>
</exclusion>
<!-- ES core already has these -->
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ES Core -> FSCrawler Core ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the excludes from fscrawler's root pom.xml. The line is still there at the time of this writing. Can it be removed, perhaps?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Just remove it here but I will also remove this in the future from the main pom.xml indeed.

README.md Outdated
<version>0.0.1-SNAPSHOT</version>

<properties>
<maven.compiler.source>1.7</maven.compiler.source>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue but I'd recommend using 1.8 as the version. I know that 1.7 will work but this JVM is no longer supported so no need to advertise it :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re completely right. I just didn’t notice the setting.
I’ll change it to 1.8

}

public Builder addTikaCustomParsers(CustomTikaParser customTikaParser) {
if (this.customTikaParsers == null) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can not normally happen?

@@ -53,6 +54,7 @@
.setUpdateRate(TimeValue.timeValueMinutes(5))
.setIndexContent(true)
.setOcr(OCR_FULL)
.setTikaCustomParsers(new ArrayList<>())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add at least a singleton List of one CustomTikaParser element here to make sure serialisation and deserialisation is working well.

Parser customParserDecorated = ParserDecorator.withTypes(customParser, customMediaTypes);
PARSERS[counter] = customParserDecorated;


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.




} catch (IOException e) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do here something like:

} catch (IOException|ClassNotFoundException|InstantiationException|IllegalAccessException e) {
    logger.error("Caught {}: {}", e.getClass().getSimpleName(), e.getMessage());
}

Or just catch any Exception?

logger.error("Caught InstantiationException:" + e.getMessage());
} catch (IllegalAccessException e) {
logger.error("Caught IllegalAccessException:" + e.getMessage());
}/*catch (TikaException te) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we don't need that?

@jevertz
Copy link
Author

jevertz commented Feb 23, 2018 via email

@dadoonet
Copy link
Owner

Ideally you would create a submodule which is added as a test dependency in the integration tests module.
So we have something that compiles and that is tested.

Then link in the README to the module and the test class.

WDYT?

@jevertz
Copy link
Author

jevertz commented Feb 23, 2018 via email

@dadoonet
Copy link
Owner

Do you perhaps know a starting point to do that when setting up a Tika parser from FSCrawler?

I'm confused. Isn't what you proposed with that:

{
  "name": "test",
  "fs": {
    "custom_tika_parsers": [
      {
        "class_name": "org.me.MyParser",
        "path_to_jar": "/some/full/path/to/myParser-0.0.1-SNAPSHOT.jar",
        "mime_types": ["application/dns", "or-another-mimetype-from-tika"]
      }
    ]
  }
}

I mean that you could create fscrawler-sample-parser.jar which you can add in a test?
If you don't know how to do it, no worries, I can do it in a follow up PR.

@jevertz
Copy link
Author

jevertz commented Feb 26, 2018 via email

# Conflicts:
#	README.md
#	settings/src/main/java/fr/pilato/elasticsearch/crawler/fs/settings/CustomTikaParser.java
#	settings/src/main/java/fr/pilato/elasticsearch/crawler/fs/settings/Fs.java
#	settings/src/test/java/fr/pilato/elasticsearch/crawler/fs/settings/FsSettingsParserTest.java
#	tika/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaInstance.java
@dadoonet
Copy link
Owner

dadoonet commented Mar 1, 2018

Not sure what is happening with your branch but I can see a lot of changes which are not related to your PR.
May be you could rebase on master and while we are at it squash the changes?

@jevertz
Copy link
Author

jevertz commented Mar 1, 2018 via email

@dadoonet
Copy link
Owner

dadoonet commented Jul 7, 2018

@jevertz Do you think you can come back with an updated PR at some point?

@dadoonet dadoonet removed this from the 2.5 milestone Jul 7, 2018
@dadoonet dadoonet changed the title Enable custom tiki Parser Enable custom Tika Parser Jul 16, 2018
@jevertz
Copy link
Author

jevertz commented Jul 30, 2018

I'll see to do the rebase. Sorry for the delay, I was on holiday.
Probably It's best to include the changes manually again. To which version do you like me to base upon?

@dadoonet
Copy link
Owner

Sorry for the delay, I was on holiday.

Sure. No problem.

Probably It's best to include the changes manually again. To which version do you like me to base upon?

Why not. You can close this PR and open a new one.
It must be done on master branch.

Let me know if you need help.

@notenbrood
Copy link

@jevertz @dadoonet Sorry for the revive, but this feature would actually be pretty interesting to us. Are you still planning on finishing this PR?

@janhoy
Copy link
Contributor

janhoy commented Sep 6, 2020

See #1004 for a related effort, which will let users take full control of what happens with a file before it gets indexed.
It could help solve this usecase, perhaps not as elegant by config but by giving you a hook to plug in your Java code.

@dadoonet dadoonet removed the new For new features or options label May 10, 2022
@dadoonet
Copy link
Owner

This is now supported I think with #1367

@dadoonet dadoonet closed this May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants