Allow conversion without image #26

joewiz · 2025-01-03T04:11:48Z

As proposed in #25, this PR adds a convert_file_without_image function, which can convert AWS-Textract-JSON to PAGE-XML without needing to read the original input image. This function takes the image’s dimensions as inputs, instead of reading these from the image.

The PR also adds “image-width” and “image-height” options to the CLI command, which triggers the use of the new function if they are supplied.

I lack python skills and cobbled this together, so pardon the lack of tests covering the new functionality.

I did run make test-api, which passed. (But make test-cli failed with the same error as before my changes ("No rule to make target OUT, needed by test-cli. Stop.")

I performed the following tests manually, confirming that the results were as expected:

Test 1: Using original features

The following command uses the original features of the utility:

textract2page --output-file page.xml tesseract.json source.jpg

... returned a 66k PAGE-XML file, with the following excerpt on line 8 referencing the input image file:

<pc:Page imageFilename="source.jpg" imageWidth="4593" imageHeight="7163">

Test 2: Using the new features via CLI

The following command uses the new features added in this PR:

textract2page --output-file page2.xml tesseract.json foo.bar --image-width 4593 --image-height 7163

... returned an identical file, except for the following excerpt on line 8, corresponding to the one above:

<pc:Page imageFilename="foo.bar" imageWidth="4593" imageHeight="7163">

I'd greatly appreciate any feedback!

- add convert_file_without_image function, which can convert AWS-Textract-JSON to PAGE-XML without reading the original input image; this function takes the image’s dimensions as inputs, instead of reading these from the image - add optional “image-width” and “image-height” options to the CLI command, which triggers the use of the new function

joewiz · 2025-01-03T04:19:05Z

textract2page/cli.py

-@click.argument("image-file", type=click.Path(dir_okay=False, exists=True))
-def cli(output_file, aws_json_file, image_file):
+@click.argument("image-file", type=str)


Please note that with this change to the image-file argument, a path that does not exist will not be caught here. I'd welcome ideas about the best way to preserve path checking while allowing the user to supply a filename when the image is not present. Perhaps a 2nd parameter would be the best way, perhaps in the form of a dedicated command that doesn't mess with the original one? Or is this ok, and we can throw an error in the convert_file function if the referenced file can't be found? (Again, please forgive my lack of Python skills and familiarity with Python idioms.)

Yes, interdependencies are not easily covered by Click. (It's doable via Cloup, but not necessary.)

I would rather keep the type, but merely set exists=False to skip verification of the path, and then cover all combinations in a simple assertion, see below.

bertsky

Many thanks @joewiz, much appreciated!

Your implementation is already near perfect IMO – see below for some cosmetic requests. (If you don't want to work them in, let me know and I'll do it myself.)

bertsky · 2025-01-06T06:41:47Z

textract2page/convert_aws.py

+    Requires the absolute dimensions of the original input image used for AWS OCR.
+    (If the image is available, call convert_file instead.)


Suggested change

Requires the absolute dimensions of the original input image used for AWS OCR.

(If the image is available, call convert_file instead.)

Also requires the original input image used for AWS OCR, but only to reference it under

`Page/@imageFilename` with its full path – does not actually require an existing file under that path.

Instead, this additionally requires the absolute dimensions (pixel resolution) of the image.

bertsky · 2025-01-06T06:50:06Z

textract2page/cli.py

-@click.argument("image-file", type=click.Path(dir_okay=False, exists=True))
-def cli(output_file, aws_json_file, image_file):
+@click.argument("image-file", type=str)


Yes, interdependencies are not easily covered by Click. (It's doable via Cloup, but not necessary.)

I would rather keep the type, but merely set exists=False to skip verification of the path, and then cover all combinations in a simple assertion, see below.

bertsky · 2025-01-06T06:53:58Z

textract2page/cli.py

+@click.argument("image-file", type=str)
+@click.option("--image-width", type=int, help="Width of the image in pixels, if the image isn't available.")
+@click.option("--image-height", type=int, help="Height of the image in pixels, if the image isn't available.")


Suggested change

@click.argument("image-file", type=str)

@click.option("--image-width", type=int, help="Width of the image in pixels, if the image isn't available.")

@click.option("--image-height", type=int, help="Height of the image in pixels, if the image isn't available.")

@click.argument("image-file", type=click.Path(dir_okay=False, exists=False))

@click.option("--image-width", type=int, help="width of the image in pixels (to avoid opening the image file)")

@click.option("--image-height", type=int, help="height of the image in pixels (to avoid opening the image file)")

bertsky · 2025-01-06T07:04:15Z

textract2page/cli.py

-    The output file will reference the image file under `Page/@imageFilename`
-    with its full path. (So you may want to use a relative path.)
+    The output file will reference the image file using the name you provide.
+    (So you may want to use a relative path.)
    """


Suggested change

"""

"""

assert image_width and image_height or image_file and exists(image_file),\

"requires passing either an existing image file path or --image-width and --image-height"

(and also at the top of the file: from os.path import exists)

bertsky · 2025-01-06T07:05:25Z

textract2page/convert_aws.py

+    convert_file_without_image(json_path, img_path, img_width, img_height, out_path)
+
+
+def convert_file_without_image(json_path: str, img_path: str, img_width: int, img_height: int, out_path: str) -> None:


Note: this function should also be exported (i.e. added to the list in __init__.py's __all__)

bertsky · 2025-01-06T07:17:48Z

(But make test-cli failed with the same error as before my changes ("No rule to make target OUT, needed by test-cli. Stop.")

Strange! Sounds like your make does not know about target-specific variables, yet. Which version are you using?

joewiz commented Jan 3, 2025

View reviewed changes

bertsky requested changes Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow conversion without image #26

Allow conversion without image #26

joewiz commented Jan 3, 2025 •

edited

Loading

joewiz Jan 3, 2025 •

edited

Loading

bertsky Jan 6, 2025

bertsky left a comment

bertsky Jan 6, 2025

bertsky Jan 6, 2025

bertsky Jan 6, 2025

bertsky Jan 6, 2025

bertsky Jan 6, 2025

bertsky commented Jan 6, 2025

		Requires the absolute dimensions of the original input image used for AWS OCR.
		(If the image is available, call convert_file instead.)

-    Requires the absolute dimensions of the original input image used for AWS OCR.
-    (If the image is available, call convert_file instead.)
+    Also requires the original input image used for AWS OCR, but only to reference it under
+    `Page/@imageFilename` with its full path – does not actually require an existing file under that path.
+    Instead, this additionally requires the absolute dimensions (pixel resolution) of the image.

-    """
+    """
+    assert image_width and image_height or image_file and exists(image_file),\
+        "requires passing either an existing image file path or --image-width and --image-height"

		convert_file_without_image(json_path, img_path, img_width, img_height, out_path)


		def convert_file_without_image(json_path: str, img_path: str, img_width: int, img_height: int, out_path: str) -> None:

Allow conversion without image #26

Are you sure you want to change the base?

Allow conversion without image #26

Conversation

joewiz commented Jan 3, 2025 • edited Loading

joewiz Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

bertsky Jan 6, 2025

Choose a reason for hiding this comment

bertsky left a comment

Choose a reason for hiding this comment

bertsky Jan 6, 2025

Choose a reason for hiding this comment

bertsky Jan 6, 2025

Choose a reason for hiding this comment

bertsky Jan 6, 2025

Choose a reason for hiding this comment

bertsky Jan 6, 2025

Choose a reason for hiding this comment

bertsky Jan 6, 2025

Choose a reason for hiding this comment

bertsky commented Jan 6, 2025

joewiz commented Jan 3, 2025 •

edited

Loading

joewiz Jan 3, 2025 •

edited

Loading