-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many-to-One mappings #186
Comments
Not sure when I'll have time to review this in depth, but... I was just pondering something like this recently... having a separate lookup table for each possible origin-destination pair. Seems a bit complicated, but it should be manageable to create in the dictionary creation code, with precise, traceable code for any specific decisions about matches that need to be made. I'm starting to think this is a better idea than what I was proposing here. My main fear would be how much this increases the size of the package due to heavy duplication of data, especially now that we have all of these cldr language variations. This probably will cause problems, or a least force changes with the custom dictionary feature. This could be problematic for other packages that are using |
No need to review this in-depth. The code is nowhere near ready, so I'd rather you preserve whatever reviewing energy for later. At this stage, I'm more interested in high-level design input. FWIW, the compressed binary with every single uni-directional map weighs 912K. I'm not sure if that's a big problem or not (I wouldn't mind, but I live somewhere with reasonably fast internet). An alternative would be to do merge on the fly, which would cut down on package size but impose a small compute penalty everytime I think it would be trivial to host a CSV file on github and a codelist in the main package for convenience. |
A clean solution might be to hold each code in a separate data frame with three columns: Here, |
Merging issues. See discussion here: #180 |
One nagging problem with
countrycode
(e.g., #182 #180 ) is that the current approach tocodelist
strictly requires bidirectional one-to-one mappings.This is problematic in cases where we want:
Russia -> RUS (iso)
USSR -> RUS (iso)
RUS -> Russia
I have been trying to find a solution forever without much result. Today, I pushed a (nearly working) branch with a potential path forward: https://github.com/vincentarelbundock/countrycode/tree/manytoone
The concept:
countrycode
. This means, for example, that we need a different regexes for Russia and USSR because Correlates of War treat them separately.dictionary/merge.R
codelist
internally, we usecodelist_map
, which is a list of lists of data.frames. For example, if we want to convert from cowc to iso3c, we usecodelist_map$cowc$iso3c
, which is a data.frame with only two columns.One key, for me is number 4 above, and right now too much still happens in the
get_*
functions. Theget
functions should just be scrapers, and users should have access to a well-document script to see how we reconciled origin vs. destination.Curious what @cjyetman thinks of this.
The text was updated successfully, but these errors were encountered: