Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COW codes Germany/Vietnam #179

Closed
sumtxt opened this issue May 31, 2018 · 27 comments
Closed

COW codes Germany/Vietnam #179

sumtxt opened this issue May 31, 2018 · 27 comments

Comments

@sumtxt
Copy link

sumtxt commented May 31, 2018

The countrycode version 1.00.0 doesn't mach COW codes for Germany (260) and Vietnam (816), ie.
countrycode(c(260,816), "cown", "country.name") outputs (NA,NA)

@vincentarelbundock
Copy link
Owner

These two countries are tricky, because CoW assigns different numerical identifiers depending on the year.

In its cross-sectional incarnation, countrycode must use a one-to-one mapping, which forces us to choose one and only one cown code per country. Currently, we have:

library(countrycode)
countrycode('Germany', 'country.name', 'cown')
[1] 255
countrycode('Vietnam', 'country.name', 'cown')
[1] 817

Are these suboptimal, you think?

If you are looking at panel data, a better option would be to use the country year dataset that is packaged with countrycode: countrycode::codelist_panel

@vincentarelbundock
Copy link
Owner

Another alternative is to use the custom_match argument:


> library(countrycode)
> countrycode(c(816, 817, 255, 260), 'cown', 'country.name', custom_match = c('816' = 'Germany', '260' = 'Vietnam'))
[1] "Germany" "Vietnam" "Germany" "Vietnam"
>

@sumtxt
Copy link
Author

sumtxt commented May 31, 2018

I see your point going from names to COW codes, but why:

countrycode(c(260,816), "cown", "country.name")
[1] NA NA

This should be

[1] German Federal Republic, Vietnam

according to COW system membership file.

@vincentarelbundock
Copy link
Owner

I don't understand what you mean. Can you show me in both directions what you want? And please add iso3c for reference.

For instance:

Germany -> 817
Germany -> DEU
German Federal Republic -> DEU
German Federal Republic -> 816
DEU -> Germany
816 -> German Federal Republic
817 -> Germany

Obviously, this breaks the one-to-one mapping condition...

@sumtxt
Copy link
Author

sumtxt commented May 31, 2018

According to COW http://www.correlatesofwar.org/data-sets/state-system-membership 260 = "German Federal Republic" and 255 = "Germany". I would expect that countrycode(260, "cown", "country.name") always outputs "German Federal Republic" since that name is assigned to the code 260 independent of any year. I understand that the output for countrycode("German Federal Republic", "country.name", "cown") depends on the year.

@vincentarelbundock
Copy link
Owner

The way countrycode works, we need the dictionary to work in BOTH DIRECTIONS. There cannot be duplicate entries, or asymmetric mappings. This is why we need a strict one-to-one map. There cannot be any one-to-two, or two-to-one.

Since the cown of "German Federal Republic" depends on the year, we need to choose either 260 or 255 in order to preserve that one-to-one symmetric mapping. To some extent, the choice is arbitrary. Currently, countrycode chose 255, which is why 260 doesn't produce anything.

The proper way to deal with this issue is to use a panel conversion dictionary instead of a cross-sectional one. This is why countrycode ships with the codelist_panel data.frame.

@cjyetman
Copy link
Collaborator

I think the problem mainly arises from the ambiguous name "Federal Republic of Germany" and its ambiguous variations. If we agreed to always refer to the former "Federal Republic of Germany" as "West Germany" only, in regex and country.name.en etc., then it might be workable. We do convert "East Germany" to 265.

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented May 31, 2018

Right. But that's a problem with CoW's coding, which we can't do anything about.


stateabb | ccode | statenme | styear | stmonth | stday | endyear | endmonth | endday | version
GMY | 255 | Germany | 1816 | 1 | 1 | 1945 | 5 | 8 | 2016
GMY | 255 | Germany | 1990 | 10 | 3 | 2016 | 12 | 31 | 2016
GFR | 260 | German Federal Republic | 1955 | 5 | 5 | 1990 | 10 | 2 | 2016
GDR | 265 | German Democratic Republic | 1954 | 3 | 25 | 1990 | 10 | 2 | 2016

@cjyetman
Copy link
Collaborator

as far as I can tell, these are always true in CoW...
cowc: GFR == cown: 260 == cow.name: "German Federal Republic" == "West Germany"
cowc: GMY == cown: 255 == cow.name: "Germany" == "Germany" (as in current Germany, and pre-1946 Germany)

so these could always work, not conflict, and always be directly reversible...
cown: 260 -> "West Germany"
"West Germany" -> cown: 260

that only becomes a problem if we want to allow something like (as some coding schemes do)...
"West Germany" -> iso3c: DEU
"Germany" -> iso3c: DEU
because then when you try to reverse, you'll have two matches

if we had...

country.name.en country.name.en.regex cown cowc cow.name iso3c every other column
West Germany "West Germany" 260 GFR German Federal Republic NA NA
Germany [current regex for Germany (assuming it doesn't match "West Germany")] 255 GMY Germany DEU [as it currently is]

it seems like it should work if we don't allow "West Germany" to be converted into anything else that's not definitively, unambiguously, exclusively equivalent to "West Germany"

@vincentarelbundock
Copy link
Owner

That's exactly the problem. And it's not just iso3c, it's "all other destinations". My sense is that CoW codes are popular, but that they are definitely a minority use case relative to "West Germany" -> All other codes. I think we want to keep the latter, even if it costs us the former.

@vincentarelbundock
Copy link
Owner

But of course, I don't have any systematic data to back that up. Just my own practice.

@cjyetman
Copy link
Collaborator

got it... so to clarify a bit, doing this would screw up some of the other (possibly more common) codes which prefer to essentially view "West Germany" and "current Germany" as the same thing

@vincentarelbundock
Copy link
Owner

Yes.

Do you share the intuition that those use-cases are more common?

@cjyetman
Copy link
Collaborator

in my research yes, but I think it's a shame because CoW is one of very few robust country codes schemes one can use for time series that go back ~20+ years

@cjyetman
Copy link
Collaborator

btw... there is a row in the current codelist for "Federal Republic of Germany" separate from the "Germany" row, but it is always NA except for country.name.en, country.name.en.regex, country.name.de, country.name.de.regex. Based on this discussion, that should probably not be there.

@vincentarelbundock
Copy link
Owner

Good catch. I removed that line in this commit: 51f4a6f

That whole discussion reinforces my belief that mostly, what we should be using is codelist_panel. The build process for that is pretty hackish, and I'm sure we could improve the substantive choices made, but that seems like a more robust avenue.

@cjyetman
Copy link
Collaborator

How is codelist_panel intended to be used?

This is somewhat convoluted, and the result isn't great either?

library(tibble)
library(dplyr)
library(countrycode)

df <- tribble(
  ~country, ~year, ~var,
  260,      1985,  1,
  265,      1985,  2,
  255,      2000,  3
)

df %>% 
  left_join(codelist_panel[, c("cown", "country.name.en", "year")], 
            by = c("country" = "cown", "year" = "year"))

# # A tibble: 3 x 4
#   country  year   var country.name.en           
#     <dbl> <dbl> <dbl> <chr>                     
# 1     260  1985     1 Germany                   
# 2     265  1985     2 German Democratic Republic
# 3     255  2000     3 Germany     

I think it would be good to think about...

  1. a good example of how to use it
  2. a better way to use it

@sumtxt
Copy link
Author

sumtxt commented May 31, 2018

The one-to-one matching logic seems to be a major departure from the previous package version as I recall it. May I ask, what's the reason for this departure?

@vincentarelbundock
Copy link
Owner

No departure. It was always like that.

@sumtxt
Copy link
Author

sumtxt commented May 31, 2018

But in previous package versions I never got "NA" for a call countrycode(260, "cown", "country.name"). Why now? Up to version 0.19 from February this year, I get "Federal Republic of Germany" but now with version 1.0.0 its NA.

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented May 31, 2018

@cjyetman

I'm not sure why you find the left_join cumbersome. My current workflow looks like this:

> # Load
> library(tidyverse)
> library(countrycode)
>
> # Simulate data
> x <- tribble(
+ ~cown, ~year, ~var1,
+ 255,      1985,  1,
+ 255,      2000,  3,
+ 2,      1985,  2
+ )
> y <- tribble(
+ ~iso3c, ~year, ~var2,
+ 'DEU',      1985,  1,
+ 'DEU',      2000,  5,
+ 'USA',      1985,  2,
+ 'DZA',      2000,  3
+ )
>
> # Left merge into panel data
> panel <- countrycode::codelist_panel %>%
+          select(iso3c, cown, year)
> panel <- purrr::reduce(list(panel, x, y), left_join)
Joining, by = c("cown", "year")
Joining, by = c("iso3c", "year")
>
> # Check the result
> panel %>% filter(!is.na(var1) | !is.na(var2))
  iso3c cown year var1 var2
1   DZA  615 2000   NA    3
2   DEU  260 1985   NA    1
3   DEU  255 2000    3    5
4   USA    2 1985    2    2

@cjyetman
Copy link
Collaborator

It just seems a bit awkward, especially for a package that generally makes things incredibly easy. I don't have a solution in mind, but I might start thinking of one.

@vincentarelbundock
Copy link
Owner

Sounds good. Let me know if you have some ideas. I'm very interested.

w.r.t. to the current issue, do you think we should include entries in dictionary_static with empty regex but values for cown and country.name.en? Those entries would be functionality-constrained, but it would solve this user's specific problem.

The important thing would be to ensure none of the entries duplicate what's in other rows (but our test suite should catch that).

@cjyetman
Copy link
Collaborator

@sumtxt I suppose it could be considered a bug...

devtools::install_github("vincentarelbundock/countrycode", ref = "0.19")
library(countrycode)
countrycode(260, "cown", "country.name")
# [1] "Federal Republic of Germany"
countrycode("Federal Republic of Germany", "country.name", "cown")
# [1] NA
# Warning messages:
#   1: In countrycode("Federal Republic of Germany", "country.name", "cown") :
#   Some values were not matched unambiguously: Federal Republic of Germany
# 
# 2: In countrycode("Federal Republic of Germany", "country.name", "cown") :
#   Some strings were matched more than once, and therefore set to <NA> in the result: Federal Republic of Germany,260,255

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented May 31, 2018

arrgh, that's can't really work, since now we're merging the dictionary based on unique regexes

Edit: And we want the regex for "Federal Republic of Germany" to map onto the other code.

@cjyetman
Copy link
Collaborator

fyi... signing off for a bit... I'll come back to this in the coming days

@vincentarelbundock
Copy link
Owner

Thanks again for opening this issue. I still recognize that this is a problem, but unfortunately this problem can only be satisfactorily addressed by a fundamental re-write of countrycode, to allow "many-to-one" mappings.

We already have a (dormant but open) issue for this, which you can follow if there is interest: #186

I have also note the problem in a new "Frequently Requested Feature" issue, so I will close this one here. I'm only closing to avoid duplication and to clean up the repo. If Many-to-One is implemented, the specific case of Germany CoW codes will automatically be solved.

Thanks for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants