Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22773 Migrate the CLDR conversion tool to Maven #3290

Merged
merged 1 commit into from
Dec 9, 2024

Conversation

mihnita
Copy link
Contributor

@mihnita mihnita commented Dec 5, 2024

This also touches / fixes (at least partially?) other issues:

  • ICU-20998 Fix conflicting uses of CLDR_DIR in cldr-to-icu process/tools
  • ICU-21426 Updates/changes to CLDR-to-ICU build tools to handle CLDR Maven migration
  • ICU-21748 Change ICU CLDR tools to use CLDR artifact directly
  • Fixed the unit tests. They didn't work before.
  • more

I will search and close / comment on these bugs and others once this lands.

Summary

  • Java code changes (the tool proper)
    • moved the XML configuration part in a separate file, out of the ant build xml
    • converted the 3 Ant Task classes to regular classes, and invoked them from a Main class.
    • the 3 files load the config from the separate xml file
    • created a "Main class", with a command line parser that recognizes all the options passed before from the ant build xml file
  • The CLDR Library (cldr-code) is now built and installed in the local Maven repo, not in a folder, and the extra lib artifact is gone (ICU-21748)
  • Updated all the documentation I was able to find

ToDo / wish list

  • renames / reorg? (see below)
  • I have a change that makes the generation 4-5 times faster
  • put together a small script (Python?) that invokes the 3-4 commands that now are in the README.md and need to be executed by hand, with copy/paste
  • There is still one ant build file one level up that imports test files from CLDR to ICU
    But it not tools/cldr/cldr-to-icu. Maybe move to Python?
  • But the last 2 points should maybe be part of a bigger task (automating things with python and moving away from copy-paste from .md files)

I tried to keep the changes at minimum.

So I preserved the old files with a "fake" Task class, with init() and execute() methods, as if it is an Ant Task.
Also, kept the files in the .../ant package.

This should (hopefully) help with the review.

I can do a cleaning after this with cleanup, refactoring (renaming and so on).

I did quite a bit of testing, but mostly manual, running the conversion tool on Linux / MacOS / Windows (which didn't work before, but works now), with all kind of options.

Checklist

  • Required: Issue filed: ICU-22773
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@mihnita mihnita requested review from echeran and markusicu December 5, 2024 17:06
@mihnita mihnita force-pushed the mihai_mavenize_cldr2icu branch from ade45b5 to a4a3f02 Compare December 5, 2024 17:18
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/Cldr2Icu.java is different
  • tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/Cldr2IcuCliOptions.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu markusicu self-assigned this Dec 6, 2024
markusicu
markusicu previously approved these changes Dec 6, 2024
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
This looks very good and plausible, but I didn't try to understand all of the Task replacements and configs.
Some minor feedback, nothing intelligent.

@@ -40,65 +64,132 @@ relies on a pre-processing step, and the CLDR data must come from the separate
"staging" repository (i.e. https://github.com/unicode-org/cldr-staging) or be
pre-processed locally into a different directory.

:point_right: **Note**: the 3 folders can also be overridden:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: do we need three ways to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately some is legacy.
We have a lot of documentation using 3 folders.
Pointing to the 3 paths where the 3 repos are cloned (icu, cldr, and cldr-staging).

There is also code relaying on that.
The cldr-code expects CLDR_DIR to be set. And not set in environment, but as Java property.
The ant did not allow one to set them from command line, but from the ant script itself, and (in some cases) from environment.
And then it used that set value(s) to set both environment AND the Java property.
Somewhere in the code you will see that I explicityl do a System.setProperty to keep thinks the way ant set them up.

So unfortunately it was tangled pretty badly.
I only kept things as they were.

There is quite a bit of refactoring that I would like to do.
And this can be one of them.
But has to also be coordinated with cldr-code, and a lot of doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is not about the three folders, but about the three ways of defining them:

  • as a shell environment variable
  • as a JVM property with -Dname=value
  • as a command-line option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but maybe worth thinking about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code and doc assumed both shell environment variable and JVM property

The docs say "set this environment variable" everywhere.
And the defined values are used in a lot of copy / paste commands.

Then the code (in several places, including cldr-code, which is not even icu) assumed that some are set as Java properties.
And failed if not.
Ant was "magically" taking some of the environment variables and making them Java properties

For example:

<condition property="cldrDataDir" value="${env.CLDR_DATA_DIR}">
    <isset property="env.CLDR_DATA_DIR"/>
</condition>

And some were "fixed" properties in the ant script:

<property name="outDir" value="${basedir}/../../../icu4c/source/data/"/>

Then the Ant convert invoked Ant again as "convert-impl" by defining CLDR_DIR in environment, and passing other things as Java properties :-(

<exec executable="ant" searchpath="true" failonerror="true">
    <env key="CLDR_DIR" value="${cldrDataDir}" />
    ...
    <arg value="-DoutDir=${outDir}"/>
    ...

And finally the real "convert" task (in the "convert-impl" target) took some of them and invoked setters in the associated Java class (extending Task):

<convert cldrDir="${cldrDir}" outputDir="${outDir}"  ... />

But the point is: some of these were both Java properties and environment variables at the same time.
I only preserved the existing behavior.
AND it is only for the folders.

So the Java code never looked at environment vars, only Ant did.
Java looked at (a few) properties + setters.


I had a chat with Elango (and a whiteboard) on what to do about it.

The main question is really between JVM properties vs command line options.
I was split between them, each have pros and cons.

In the end CLI won. The rest needs to be unified.
But not a hight priority, and not in this CL.
It will need to (among others) change the cldr-code library in the cldr repo.

Note that only the 3 folders have this "override behavior", not the rest.
Because they had it already.
Grep the old code (before my CL):

$ grep -r   System.getProperty src/

src//test/java/org/unicode/icu/tool/cldrtoicu/SupplementalDataTest.java:        Path cldrRoot = Paths.get(System.getProperty("CLDR_DIR"));
src//main/java/org/unicode/icu/tool/cldrtoicu/IcuConverterConfig.java:        Optional.ofNullable(System.getProperty("ICU_DIR", null))
src//main/java/org/unicode/icu/tool/cldrtoicu/DebugWriter.java:        String cldrPath = System.getProperty("CLDR_DIR", System.getenv("CLDR_DIR"));

And note this in DebugWriter.java:

String cldrPath = System.getProperty("CLDR_DIR", System.getenv("CLDR_DIR"));

This assumption was there.
And it was too much surgery to fix in one step.
That's why I was forced to do a System.setProperty("CLDR_DIR", cldrDir); (in Cldr2IcuCliOptions.java)


TLDR: I am not arguing that it is good, or that I like it.
What I am trying to say is that was unclean before, it is the same now.

Deserves a cleanup, but it is not mavenization, and is not as simple as removing this override that I implemented here.

For example can't remove it from cldr-code, because CLDR itself also depends on it.
So the processes / scripts / docs in CLDR would need to be updated BEFORE removing it from this code.

@mihnita
Copy link
Contributor Author

mihnita commented Dec 6, 2024

Thanks! This looks very good and plausible, but I didn't try to understand all of the Task replacements and configs. Some minor feedback, nothing intelligent.

Thank you.


Yes, it also took me quite some time to figure it out, and understand what Ant was doing.
So I am basically doing a minimal Ant replacement :-)

On the way I tried to replace the config with json.
Then I tried to use xml and parse it with Jackson, but Jackson didn't call the setters that Ant was calling, and could not do xml attributes.
To make it work would have required a lot more refactoring.

What I ended up doing (short version)
I create a main class (Cldr2Icu) that simulates (to a degree) the work that Ant was doing.

Basically load the "task classes", load the xml fragment corresponding to each task, call init() to do extra initialization and validation, then call execute() to run the task proper. Which is kind of what Ant does with each task.

The config file is 100% sub-tree(s) that existed in the Ant build file.
The project/target[@name="convert-impl"]/convert in the old xml is now config/convert
And all the attributes on that old xml node become CLI options.
(while diffing them now I found some extra comments that I removed, so now they are really the same)

@mihnita mihnita requested a review from markusicu December 6, 2024 20:52
markusicu
markusicu previously approved these changes Dec 6, 2024
@markusicu
Copy link
Member

sgtm -- unless Elango objects, I can offer a rubber stamp...

@mihnita
Copy link
Contributor Author

mihnita commented Dec 7, 2024

I also managed to fix the tests (6f4fea1)

They were cause by real changes in the data (for example the likely subtags refactoring).
But these didn't fail, because they are not under icu4j/main, so nobody runs them.

@mihnita mihnita requested a review from markusicu December 7, 2024 21:00
echeran
echeran previously approved these changes Dec 9, 2024
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mihnita mihnita force-pushed the mihai_mavenize_cldr2icu branch from 6f4fea1 to e23801f Compare December 9, 2024 17:24
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

markusicu
markusicu previously approved these changes Dec 9, 2024
@mihnita mihnita dismissed stale reviews from markusicu and echeran via edf6e86 December 9, 2024 17:58
@@ -472,12 +472,15 @@ mvn install -pl main/common_tests -Dtest=MeasureUnitTest#TestGreek
13b. Optionally run the tests in exhaustive mode

Optionally run before committing changes, or run to diagnose failures from
running exhaustive CI tests in the PR using `/azp run CI-Exhaustive`:
running exhaustive CI tests in the PR using exhaustive localy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running exhaustive CI tests locally:

Exhaustive tests in CL can ve triggered by running the "Exhaustive Tests for ICU"
action from the GitHub web UI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo fix & adding link:

Exhaustive tests in CL can be triggered by running the "Exhaustive Tests for ICU"
action from the GitHub web UI:
https://unicode-org.github.io/icu/userguide/dev/ci.html#exhaustive-tests

except nicer if we can link to the .md page with a relative link, and GitHub Pages turns that into the link to the .html page

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder / nice to have: add the link like this or nicer for GH Pages source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link, hope it works.
I used CI (not CL) intentionally. Because that is a CI test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, I had missed the "CL" typo

@@ -471,14 +471,13 @@ mvn install -pl main/common_tests -Dtest=MeasureUnitTest#TestGreek

13b. Optionally run the tests in exhaustive mode

Optionally run before committing changes, or run to diagnose failures from
running exhaustive CI tests in the PR using exhaustive localy:
Optionally run exhaustive tests localy before committing changes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: localy --> locally

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@mihnita mihnita force-pushed the mihai_mavenize_cldr2icu branch from 8d24a58 to 3b04fe6 Compare December 9, 2024 20:55
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@mihnita
Copy link
Contributor Author

mihnita commented Dec 9, 2024

Thank you all!

@mihnita mihnita merged commit 2fa8a09 into unicode-org:main Dec 9, 2024
92 checks passed
@mihnita mihnita deleted the mihai_mavenize_cldr2icu branch December 9, 2024 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants