Many-to-One mappings #186

vincentarelbundock · 2018-07-23T15:10:21Z

One nagging problem with countrycode (e.g., #182 #180 ) is that the current approach to codelist strictly requires bidirectional one-to-one mappings.

This is problematic in cases where we want:

Russia -> RUS (iso)
USSR -> RUS (iso)
RUS -> Russia

I have been trying to find a solution forever without much result. Today, I pushed a (nearly working) branch with a potential path forward: https://github.com/vincentarelbundock/countrycode/tree/manytoone

The concept:

A unique regex identifies every single geographic unit covered by any of the schemes in countrycode. This means, for example, that we need a different regexes for Russia and USSR because Correlates of War treat them separately.
Each destination code must be associated with one and only one regex: many-to-one
origin codes can be associated with more than one regex: many-to-one
This requires that we keep separate lists of origin and destination codes. The differences between origin and destination codes are handled explicitly in a centralized location: dictionary/merge.R
instead of using codelist internally, we use codelist_map, which is a list of lists of data.frames. For example, if we want to convert from cowc to iso3c, we use codelist_map$cowc$iso3c, which is a data.frame with only two columns.

One key, for me is number 4 above, and right now too much still happens in the get_* functions. The get functions should just be scrapers, and users should have access to a well-document script to see how we reconciled origin vs. destination.

Curious what @cjyetman thinks of this.

The text was updated successfully, but these errors were encountered:

cjyetman · 2018-07-25T07:53:25Z

Not sure when I'll have time to review this in depth, but...

I was just pondering something like this recently... having a separate lookup table for each possible origin-destination pair. Seems a bit complicated, but it should be manageable to create in the dictionary creation code, with precise, traceable code for any specific decisions about matches that need to be made. I'm starting to think this is a better idea than what I was proposing here. My main fear would be how much this increases the size of the package due to heavy duplication of data, especially now that we have all of these cldr language variations.

This probably will cause problems, or a least force changes with the custom dictionary feature.

This could be problematic for other packages that are using codelist directly, though that's never how it was meant to be used anyway (afaik). In the same vein, we no longer include a large CSV lookup table in the repo, which I think some people were pulling for use in their own projects (hopefully with attribution).

vincentarelbundock · 2018-07-25T12:02:03Z

No need to review this in-depth. The code is nowhere near ready, so I'd rather you preserve whatever reviewing energy for later. At this stage, I'm more interested in high-level design input.

FWIW, the compressed binary with every single uni-directional map weighs 912K. I'm not sure if that's a big problem or not (I wouldn't mind, but I live somewhere with reasonably fast internet).

An alternative would be to do merge on the fly, which would cut down on package size but impose a small compute penalty everytime countrycode is invoked. Maybe there's a way to cache it.

I think it would be trivial to host a CSV file on github and a codelist in the main package for convenience.

vincentarelbundock · 2018-07-25T12:12:45Z

A clean solution might be to hold each code in a separate data frame with three columns: country.name.en.regex, code, unique_target. Then, we merge those dictionaries on the fly, and use the memoise package to speed-up repeated invocations.

Here, memoise could be a suggests rather than a depends.

vincentarelbundock · 2020-02-08T14:44:07Z

Merging issues. See discussion here: #180

cjyetman added the many-to-one issue label Oct 27, 2018

This was referenced Aug 24, 2022

Convert between currency name and currency codes #293

Closed

Frequently requested features #314

Open

COW codes Germany/Vietnam #179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many-to-One mappings #186

Many-to-One mappings #186

vincentarelbundock commented Jul 23, 2018

cjyetman commented Jul 25, 2018

vincentarelbundock commented Jul 25, 2018

vincentarelbundock commented Jul 25, 2018

vincentarelbundock commented Feb 8, 2020

Many-to-One mappings #186

Many-to-One mappings #186

Comments

vincentarelbundock commented Jul 23, 2018

cjyetman commented Jul 25, 2018

vincentarelbundock commented Jul 25, 2018

vincentarelbundock commented Jul 25, 2018

vincentarelbundock commented Feb 8, 2020