-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow conversion without image #26
base: master
Are you sure you want to change the base?
Conversation
- add convert_file_without_image function, which can convert AWS-Textract-JSON to PAGE-XML without reading the original input image; this function takes the image’s dimensions as inputs, instead of reading these from the image - add optional “image-width” and “image-height” options to the CLI command, which triggers the use of the new function
@click.argument("image-file", type=click.Path(dir_okay=False, exists=True)) | ||
def cli(output_file, aws_json_file, image_file): | ||
@click.argument("image-file", type=str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that with this change to the image-file
argument, a path that does not exist will not be caught here. I'd welcome ideas about the best way to preserve path checking while allowing the user to supply a filename when the image is not present. Perhaps a 2nd parameter would be the best way, perhaps in the form of a dedicated command that doesn't mess with the original one? Or is this ok, and we can throw an error in the convert_file
function if the referenced file can't be found? (Again, please forgive my lack of Python skills and familiarity with Python idioms.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, interdependencies are not easily covered by Click. (It's doable via Cloup, but not necessary.)
I would rather keep the type
, but merely set exists=False
to skip verification of the path, and then cover all combinations in a simple assertion, see below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks @joewiz, much appreciated!
Your implementation is already near perfect IMO – see below for some cosmetic requests. (If you don't want to work them in, let me know and I'll do it myself.)
Requires the absolute dimensions of the original input image used for AWS OCR. | ||
(If the image is available, call convert_file instead.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires the absolute dimensions of the original input image used for AWS OCR. | |
(If the image is available, call convert_file instead.) | |
Also requires the original input image used for AWS OCR, but only to reference it under | |
`Page/@imageFilename` with its full path – does not actually require an existing file under that path. | |
Instead, this additionally requires the absolute dimensions (pixel resolution) of the image. |
@click.argument("image-file", type=click.Path(dir_okay=False, exists=True)) | ||
def cli(output_file, aws_json_file, image_file): | ||
@click.argument("image-file", type=str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, interdependencies are not easily covered by Click. (It's doable via Cloup, but not necessary.)
I would rather keep the type
, but merely set exists=False
to skip verification of the path, and then cover all combinations in a simple assertion, see below.
@click.argument("image-file", type=str) | ||
@click.option("--image-width", type=int, help="Width of the image in pixels, if the image isn't available.") | ||
@click.option("--image-height", type=int, help="Height of the image in pixels, if the image isn't available.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@click.argument("image-file", type=str) | |
@click.option("--image-width", type=int, help="Width of the image in pixels, if the image isn't available.") | |
@click.option("--image-height", type=int, help="Height of the image in pixels, if the image isn't available.") | |
@click.argument("image-file", type=click.Path(dir_okay=False, exists=False)) | |
@click.option("--image-width", type=int, help="width of the image in pixels (to avoid opening the image file)") | |
@click.option("--image-height", type=int, help="height of the image in pixels (to avoid opening the image file)") |
The output file will reference the image file under `Page/@imageFilename` | ||
with its full path. (So you may want to use a relative path.) | ||
The output file will reference the image file using the name you provide. | ||
(So you may want to use a relative path.) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
""" | |
assert image_width and image_height or image_file and exists(image_file),\ | |
"requires passing either an existing image file path or --image-width and --image-height" |
(and also at the top of the file: from os.path import exists
)
convert_file_without_image(json_path, img_path, img_width, img_height, out_path) | ||
|
||
|
||
def convert_file_without_image(json_path: str, img_path: str, img_width: int, img_height: int, out_path: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this function should also be exported (i.e. added to the list in __init__.py
's __all__
)
Strange! Sounds like your |
As proposed in #25, this PR adds a
convert_file_without_image
function, which can convert AWS-Textract-JSON to PAGE-XML without needing to read the original input image. This function takes the image’s dimensions as inputs, instead of reading these from the image.The PR also adds “image-width” and “image-height” options to the CLI command, which triggers the use of the new function if they are supplied.
I lack python skills and cobbled this together, so pardon the lack of tests covering the new functionality.
I did run
make test-api
, which passed. (Butmake test-cli
failed with the same error as before my changes ("No rule to make targetOUT
, needed bytest-cli
. Stop.")I performed the following tests manually, confirming that the results were as expected:
Test 1: Using original features
The following command uses the original features of the utility:
... returned a 66k PAGE-XML file, with the following excerpt on line 8 referencing the input image file:
Test 2: Using the new features via CLI
The following command uses the new features added in this PR:
... returned an identical file, except for the following excerpt on line 8, corresponding to the one above:
I'd greatly appreciate any feedback!