Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many-to-One mappings #186

Open
vincentarelbundock opened this issue Jul 23, 2018 · 4 comments
Open

Many-to-One mappings #186

vincentarelbundock opened this issue Jul 23, 2018 · 4 comments

Comments

@vincentarelbundock
Copy link
Owner

One nagging problem with countrycode (e.g., #182 #180 ) is that the current approach to codelist strictly requires bidirectional one-to-one mappings.

This is problematic in cases where we want:

Russia -> RUS (iso)
USSR -> RUS (iso)
RUS -> Russia

I have been trying to find a solution forever without much result. Today, I pushed a (nearly working) branch with a potential path forward: https://github.com/vincentarelbundock/countrycode/tree/manytoone

The concept:

  1. A unique regex identifies every single geographic unit covered by any of the schemes in countrycode. This means, for example, that we need a different regexes for Russia and USSR because Correlates of War treat them separately.
  2. Each destination code must be associated with one and only one regex: many-to-one
  3. origin codes can be associated with more than one regex: many-to-one
  4. This requires that we keep separate lists of origin and destination codes. The differences between origin and destination codes are handled explicitly in a centralized location: dictionary/merge.R
  5. instead of using codelist internally, we use codelist_map, which is a list of lists of data.frames. For example, if we want to convert from cowc to iso3c, we use codelist_map$cowc$iso3c, which is a data.frame with only two columns.

One key, for me is number 4 above, and right now too much still happens in the get_* functions. The get functions should just be scrapers, and users should have access to a well-document script to see how we reconciled origin vs. destination.

Curious what @cjyetman thinks of this.

@cjyetman
Copy link
Collaborator

Not sure when I'll have time to review this in depth, but...

I was just pondering something like this recently... having a separate lookup table for each possible origin-destination pair. Seems a bit complicated, but it should be manageable to create in the dictionary creation code, with precise, traceable code for any specific decisions about matches that need to be made. I'm starting to think this is a better idea than what I was proposing here. My main fear would be how much this increases the size of the package due to heavy duplication of data, especially now that we have all of these cldr language variations.

This probably will cause problems, or a least force changes with the custom dictionary feature.

This could be problematic for other packages that are using codelist directly, though that's never how it was meant to be used anyway (afaik). In the same vein, we no longer include a large CSV lookup table in the repo, which I think some people were pulling for use in their own projects (hopefully with attribution).

@vincentarelbundock
Copy link
Owner Author

No need to review this in-depth. The code is nowhere near ready, so I'd rather you preserve whatever reviewing energy for later. At this stage, I'm more interested in high-level design input.

FWIW, the compressed binary with every single uni-directional map weighs 912K. I'm not sure if that's a big problem or not (I wouldn't mind, but I live somewhere with reasonably fast internet).

An alternative would be to do merge on the fly, which would cut down on package size but impose a small compute penalty everytime countrycode is invoked. Maybe there's a way to cache it.

I think it would be trivial to host a CSV file on github and a codelist in the main package for convenience.

@vincentarelbundock
Copy link
Owner Author

A clean solution might be to hold each code in a separate data frame with three columns: country.name.en.regex, code, unique_target. Then, we merge those dictionaries on the fly, and use the memoise package to speed-up repeated invocations.

Here, memoise could be a suggests rather than a depends.

@vincentarelbundock
Copy link
Owner Author

Merging issues. See discussion here: #180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants