Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow conversion without image #26

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

joewiz
Copy link

@joewiz joewiz commented Jan 3, 2025

As proposed in #25, this PR adds a convert_file_without_image function, which can convert AWS-Textract-JSON to PAGE-XML without needing to read the original input image. This function takes the image’s dimensions as inputs, instead of reading these from the image.

The PR also adds “image-width” and “image-height” options to the CLI command, which triggers the use of the new function if they are supplied.

I lack python skills and cobbled this together, so pardon the lack of tests covering the new functionality.

I did run make test-api, which passed. (But make test-cli failed with the same error as before my changes ("No rule to make target OUT, needed by test-cli. Stop.")

I performed the following tests manually, confirming that the results were as expected:

Test 1: Using original features

The following command uses the original features of the utility:

textract2page --output-file page.xml tesseract.json source.jpg

... returned a 66k PAGE-XML file, with the following excerpt on line 8 referencing the input image file:

<pc:Page imageFilename="source.jpg" imageWidth="4593" imageHeight="7163">

Test 2: Using the new features via CLI

The following command uses the new features added in this PR:

textract2page --output-file page2.xml tesseract.json foo.bar --image-width 4593 --image-height 7163

... returned an identical file, except for the following excerpt on line 8, corresponding to the one above:

<pc:Page imageFilename="foo.bar" imageWidth="4593" imageHeight="7163">

I'd greatly appreciate any feedback!

- add convert_file_without_image function, which can convert AWS-Textract-JSON to PAGE-XML without reading the original input image; this function takes the image’s dimensions as inputs, instead of reading these from the image
- add optional “image-width” and “image-height” options to the CLI command, which triggers the use of the new function
@click.argument("image-file", type=click.Path(dir_okay=False, exists=True))
def cli(output_file, aws_json_file, image_file):
@click.argument("image-file", type=str)
Copy link
Author

@joewiz joewiz Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that with this change to the image-file argument, a path that does not exist will not be caught here. I'd welcome ideas about the best way to preserve path checking while allowing the user to supply a filename when the image is not present. Perhaps a 2nd parameter would be the best way, perhaps in the form of a dedicated command that doesn't mess with the original one? Or is this ok, and we can throw an error in the convert_file function if the referenced file can't be found? (Again, please forgive my lack of Python skills and familiarity with Python idioms.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, interdependencies are not easily covered by Click. (It's doable via Cloup, but not necessary.)

I would rather keep the type, but merely set exists=False to skip verification of the path, and then cover all combinations in a simple assertion, see below.

Copy link
Member

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @joewiz, much appreciated!

Your implementation is already near perfect IMO – see below for some cosmetic requests. (If you don't want to work them in, let me know and I'll do it myself.)

Comment on lines +731 to +732
Requires the absolute dimensions of the original input image used for AWS OCR.
(If the image is available, call convert_file instead.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Requires the absolute dimensions of the original input image used for AWS OCR.
(If the image is available, call convert_file instead.)
Also requires the original input image used for AWS OCR, but only to reference it under
`Page/@imageFilename` with its full pathdoes not actually require an existing file under that path.
Instead, this additionally requires the absolute dimensions (pixel resolution) of the image.

@click.argument("image-file", type=click.Path(dir_okay=False, exists=True))
def cli(output_file, aws_json_file, image_file):
@click.argument("image-file", type=str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, interdependencies are not easily covered by Click. (It's doable via Cloup, but not necessary.)

I would rather keep the type, but merely set exists=False to skip verification of the path, and then cover all combinations in a simple assertion, see below.

Comment on lines +17 to +19
@click.argument("image-file", type=str)
@click.option("--image-width", type=int, help="Width of the image in pixels, if the image isn't available.")
@click.option("--image-height", type=int, help="Height of the image in pixels, if the image isn't available.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@click.argument("image-file", type=str)
@click.option("--image-width", type=int, help="Width of the image in pixels, if the image isn't available.")
@click.option("--image-height", type=int, help="Height of the image in pixels, if the image isn't available.")
@click.argument("image-file", type=click.Path(dir_okay=False, exists=False))
@click.option("--image-width", type=int, help="width of the image in pixels (to avoid opening the image file)")
@click.option("--image-height", type=int, help="height of the image in pixels (to avoid opening the image file)")

The output file will reference the image file under `Page/@imageFilename`
with its full path. (So you may want to use a relative path.)
The output file will reference the image file using the name you provide.
(So you may want to use a relative path.)
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
"""
assert image_width and image_height or image_file and exists(image_file),\
"requires passing either an existing image file path or --image-width and --image-height"

(and also at the top of the file: from os.path import exists)

convert_file_without_image(json_path, img_path, img_width, img_height, out_path)


def convert_file_without_image(json_path: str, img_path: str, img_width: int, img_height: int, out_path: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this function should also be exported (i.e. added to the list in __init__.py's __all__)

@bertsky
Copy link
Member

bertsky commented Jan 6, 2025

(But make test-cli failed with the same error as before my changes ("No rule to make target OUT, needed by test-cli. Stop.")

Strange! Sounds like your make does not know about target-specific variables, yet. Which version are you using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants