-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable custom Tika Parser #498
Conversation
Very interesting. I like the idea. I wonder if it's absolutely needed to provide the I'll give a proper review later but thanks for sharing your code with this PR! |
You’re. welcome.
Actually I implemented first passing a Tika Config which can be useful, but not for my use-case.
So I took a different approach and removed the need for a Tika Config file later.
Best
Jochen Evertz
|
Hi @jevertz Would you mind rebasing your code on latest master branch? |
Also I think it would be good to document that change in the README. |
Hi David,
might take a little since the ES version changed. Don’t want to break my setup.
But not weeks.
Best
Jochen
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks very promising. Thanks!
I added a lot of "format" comments.
In the README, I'm always trying to print a full setting configuration in https://github.com/dadoonet/fscrawler#job-file-specification
May be you could add your changes as well here. IIRC I'm using part of the output of the FsSettingsParserTest
execution.
README.md
Outdated
It might occur that one or more existing Tika parsers do not provide the intended information, or just do not exist. | ||
This setting allows to use a custom parser instead. | ||
The parsers must be provided as a .jar, but does not need to be on any classpath. | ||
Note that this is an array. Here an example for just one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
Here an example for just one
to
Here an example for just one:
README.md
Outdated
|
||
Some info about creating a custom parser is available [here](https://tika.apache.org/1.17/parser_guide.html) | ||
Or use a existing parser as a blueprint. Make sure to choose the correct branch. | ||
At the time of this writing fscrawler uses Tika 1.17, while on github the main branch is 2.x. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main branch is 2.x
to
the main Tika branch is 2.x
README.md
Outdated
The parsers from ["branch_1x"](https://github.com/apache/tika/tree/branch_1x/tika-parsers/src/main/java/org/apache/tika/parser) should work fine. | ||
|
||
To build the custom parser separately, a pom file can be derived from the tika-parsers [pom.xml](https://github.com/apache/tika/blob/branch_1x/tika-parsers/pom.xml). | ||
Probably a lot can be left out. Here is an example which required fontbox (guess still to long, but worked). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYM with "guess still to long, but worked"
too long? Is it a comment you are making here? If so I believe this is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to indicate that I didn’t really evaluated the Pom.xml, but stopped when it build without errors.
Should get marked someway as to-do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to
(The exclusions are copied 1:1 from fscrawler's pom.xml, to be on the safe side)
<groupId>com.uwyn</groupId> | ||
<artifactId>jhighlight</artifactId> | ||
</exclusion> | ||
<!-- ES core already has these --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ES Core -> FSCrawler Core ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the excludes from fscrawler's root pom.xml. The line is still there at the time of this writing. Can it be removed, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Just remove it here but I will also remove this in the future from the main pom.xml
indeed.
README.md
Outdated
<version>0.0.1-SNAPSHOT</version> | ||
|
||
<properties> | ||
<maven.compiler.source>1.7</maven.compiler.source> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an issue but I'd recommend using 1.8 as the version. I know that 1.7 will work but this JVM is no longer supported so no need to advertise it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re completely right. I just didn’t notice the setting.
I’ll change it to 1.8
} | ||
|
||
public Builder addTikaCustomParsers(CustomTikaParser customTikaParser) { | ||
if (this.customTikaParsers == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can not normally happen?
@@ -53,6 +54,7 @@ | |||
.setUpdateRate(TimeValue.timeValueMinutes(5)) | |||
.setIndexContent(true) | |||
.setOcr(OCR_FULL) | |||
.setTikaCustomParsers(new ArrayList<>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add at least a singleton List of one CustomTikaParser
element here to make sure serialisation and deserialisation is working well.
Parser customParserDecorated = ParserDecorator.withTypes(customParser, customMediaTypes); | ||
PARSERS[counter] = customParserDecorated; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
|
||
|
||
|
||
} catch (IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do here something like:
} catch (IOException|ClassNotFoundException|InstantiationException|IllegalAccessException e) {
logger.error("Caught {}: {}", e.getClass().getSimpleName(), e.getMessage());
}
Or just catch any Exception
?
logger.error("Caught InstantiationException:" + e.getMessage()); | ||
} catch (IllegalAccessException e) { | ||
logger.error("Caught IllegalAccessException:" + e.getMessage()); | ||
}/*catch (TikaException te) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we don't need that?
You’re welcome!
Sorry for the delay. I‘ll merge your corrections ASAP - what probably means next Monday.
I wasn‘t sure whether it’s a good idea to include it in a full example since an error will be thrown if the custom jar doesn’t exist.
Or do you think it is useful to add a complete custom parser?
I didn’t do it since I thought it might confuse most users.
|
Ideally you would create a submodule which is added as a test dependency in the integration tests module. Then link in the README to the module and the test class. WDYT? |
OK, I‘ll see to it.
Might be clearer to use an example with a custom extension instead of mime-type.
Have to look into that. Do you perhaps know a starting point to do that when setting up a Tika parser from FSCrawler?
|
I'm confused. Isn't what you proposed with that: {
"name": "test",
"fs": {
"custom_tika_parsers": [
{
"class_name": "org.me.MyParser",
"path_to_jar": "/some/full/path/to/myParser-0.0.1-SNAPSHOT.jar",
"mime_types": ["application/dns", "or-another-mimetype-from-tika"]
}
]
}
} I mean that you could create |
Sorry, for causing this confusion. All my fault. That was expressed bad from me.
At the moment I use the mime_types. However there are files which do not have a mime type, but a custom extension.
That means I need to pass a custom mime type to Tika.
Thought you might have something like that before, so I dared to asked.
I’ll just see how to accomplish it. The documentation of Tika<https://tika.apache.org/1.17/configuring.html> is a little brief in this regard:
Configuring Mime Types
TODO Mention non-standard paths, and custom mime type files
So I dared to ask whether you might encountered this and give me a hint were to look.
Probably the decorators, but I’ll find out
Best
Jochen
|
# Conflicts: # README.md # settings/src/main/java/fr/pilato/elasticsearch/crawler/fs/settings/CustomTikaParser.java # settings/src/main/java/fr/pilato/elasticsearch/crawler/fs/settings/Fs.java # settings/src/test/java/fr/pilato/elasticsearch/crawler/fs/settings/FsSettingsParserTest.java # tika/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaInstance.java
Not sure what is happening with your branch but I can see a lot of changes which are not related to your PR. |
Sorry,I messed up the rebase.
I‘ll see to it to correct it.
We used perforce before, so I can only apologize for making noob git mistakes.
|
@jevertz Do you think you can come back with an updated PR at some point? |
I'll see to do the rebase. Sorry for the delay, I was on holiday. |
Sure. No problem.
Why not. You can close this PR and open a new one. Let me know if you need help. |
See #1004 for a related effort, which will let users take full control of what happens with a file before it gets indexed. |
This is now supported I think with #1367 |
This makes it possible to use a custom Tika Parser.
To use the parser, just an addition to _settings.json is necessary.
Here's an example
The jar file does not need to be on the class path.
The parser needs to be a Tika parser, like described here