COW codes Germany/Vietnam #179

sumtxt · 2018-05-31T15:28:10Z

The countrycode version 1.00.0 doesn't mach COW codes for Germany (260) and Vietnam (816), ie.
countrycode(c(260,816), "cown", "country.name") outputs (NA,NA)

vincentarelbundock · 2018-05-31T15:36:35Z

These two countries are tricky, because CoW assigns different numerical identifiers depending on the year.

In its cross-sectional incarnation, countrycode must use a one-to-one mapping, which forces us to choose one and only one cown code per country. Currently, we have:

library(countrycode)
countrycode('Germany', 'country.name', 'cown')
[1] 255
countrycode('Vietnam', 'country.name', 'cown')
[1] 817

Are these suboptimal, you think?

If you are looking at panel data, a better option would be to use the country year dataset that is packaged with countrycode: countrycode::codelist_panel

vincentarelbundock · 2018-05-31T15:41:11Z

Another alternative is to use the custom_match argument:


> library(countrycode)
> countrycode(c(816, 817, 255, 260), 'cown', 'country.name', custom_match = c('816' = 'Germany', '260' = 'Vietnam'))
[1] "Germany" "Vietnam" "Germany" "Vietnam"
>

sumtxt · 2018-05-31T15:48:55Z

I see your point going from names to COW codes, but why:

countrycode(c(260,816), "cown", "country.name")
[1] NA NA

This should be

[1] German Federal Republic, Vietnam

according to COW system membership file.

vincentarelbundock · 2018-05-31T15:52:35Z

I don't understand what you mean. Can you show me in both directions what you want? And please add iso3c for reference.

For instance:

Germany -> 817
Germany -> DEU
German Federal Republic -> DEU
German Federal Republic -> 816
DEU -> Germany
816 -> German Federal Republic
817 -> Germany

Obviously, this breaks the one-to-one mapping condition...

sumtxt · 2018-05-31T16:05:28Z

According to COW http://www.correlatesofwar.org/data-sets/state-system-membership 260 = "German Federal Republic" and 255 = "Germany". I would expect that countrycode(260, "cown", "country.name") always outputs "German Federal Republic" since that name is assigned to the code 260 independent of any year. I understand that the output for countrycode("German Federal Republic", "country.name", "cown") depends on the year.

vincentarelbundock · 2018-05-31T16:13:37Z

The way countrycode works, we need the dictionary to work in BOTH DIRECTIONS. There cannot be duplicate entries, or asymmetric mappings. This is why we need a strict one-to-one map. There cannot be any one-to-two, or two-to-one.

Since the cown of "German Federal Republic" depends on the year, we need to choose either 260 or 255 in order to preserve that one-to-one symmetric mapping. To some extent, the choice is arbitrary. Currently, countrycode chose 255, which is why 260 doesn't produce anything.

The proper way to deal with this issue is to use a panel conversion dictionary instead of a cross-sectional one. This is why countrycode ships with the codelist_panel data.frame.

cjyetman · 2018-05-31T16:57:15Z

I think the problem mainly arises from the ambiguous name "Federal Republic of Germany" and its ambiguous variations. If we agreed to always refer to the former "Federal Republic of Germany" as "West Germany" only, in regex and country.name.en etc., then it might be workable. We do convert "East Germany" to 265.

vincentarelbundock · 2018-05-31T17:01:43Z

Right. But that's a problem with CoW's coding, which we can't do anything about.


stateabb | ccode | statenme | styear | stmonth | stday | endyear | endmonth | endday | version
GMY | 255 | Germany | 1816 | 1 | 1 | 1945 | 5 | 8 | 2016
GMY | 255 | Germany | 1990 | 10 | 3 | 2016 | 12 | 31 | 2016
GFR | 260 | German Federal Republic | 1955 | 5 | 5 | 1990 | 10 | 2 | 2016
GDR | 265 | German Democratic Republic | 1954 | 3 | 25 | 1990 | 10 | 2 | 2016

cjyetman · 2018-05-31T17:50:37Z

as far as I can tell, these are always true in CoW...
cowc: GFR == cown: 260 == cow.name: "German Federal Republic" == "West Germany"
cowc: GMY == cown: 255 == cow.name: "Germany" == "Germany" (as in current Germany, and pre-1946 Germany)

so these could always work, not conflict, and always be directly reversible...
cown: 260 -> "West Germany"
"West Germany" -> cown: 260

that only becomes a problem if we want to allow something like (as some coding schemes do)...
"West Germany" -> iso3c: DEU
"Germany" -> iso3c: DEU
because then when you try to reverse, you'll have two matches

if we had...

country.name.en	country.name.en.regex	cown	cowc	cow.name	iso3c	every other column
West Germany	"West Germany"	260	GFR	German Federal Republic	`NA`	`NA`
Germany	[current regex for Germany (assuming it doesn't match "West Germany")]	255	GMY	Germany	DEU	[as it currently is]

it seems like it should work if we don't allow "West Germany" to be converted into anything else that's not definitively, unambiguously, exclusively equivalent to "West Germany"

vincentarelbundock · 2018-05-31T18:00:34Z

That's exactly the problem. And it's not just iso3c, it's "all other destinations". My sense is that CoW codes are popular, but that they are definitely a minority use case relative to "West Germany" -> All other codes. I think we want to keep the latter, even if it costs us the former.

vincentarelbundock · 2018-05-31T18:01:29Z

But of course, I don't have any systematic data to back that up. Just my own practice.

cjyetman · 2018-05-31T18:06:06Z

got it... so to clarify a bit, doing this would screw up some of the other (possibly more common) codes which prefer to essentially view "West Germany" and "current Germany" as the same thing

vincentarelbundock · 2018-05-31T18:08:08Z

Yes.

Do you share the intuition that those use-cases are more common?

cjyetman · 2018-05-31T18:11:28Z

in my research yes, but I think it's a shame because CoW is one of very few robust country codes schemes one can use for time series that go back ~20+ years

cjyetman · 2018-05-31T18:14:25Z

btw... there is a row in the current codelist for "Federal Republic of Germany" separate from the "Germany" row, but it is always NA except for country.name.en, country.name.en.regex, country.name.de, country.name.de.regex. Based on this discussion, that should probably not be there.

vincentarelbundock · 2018-05-31T18:20:06Z

Good catch. I removed that line in this commit: 51f4a6f

That whole discussion reinforces my belief that mostly, what we should be using is codelist_panel. The build process for that is pretty hackish, and I'm sure we could improve the substantive choices made, but that seems like a more robust avenue.

cjyetman · 2018-05-31T19:11:59Z

How is codelist_panel intended to be used?

This is somewhat convoluted, and the result isn't great either?

library(tibble)
library(dplyr)
library(countrycode)

df <- tribble(
  ~country, ~year, ~var,
  260,      1985,  1,
  265,      1985,  2,
  255,      2000,  3
)

df %>% 
  left_join(codelist_panel[, c("cown", "country.name.en", "year")], 
            by = c("country" = "cown", "year" = "year"))

# # A tibble: 3 x 4
#   country  year   var country.name.en           
#     <dbl> <dbl> <dbl> <chr>                     
# 1     260  1985     1 Germany                   
# 2     265  1985     2 German Democratic Republic
# 3     255  2000     3 Germany

I think it would be good to think about...

a good example of how to use it
a better way to use it

sumtxt · 2018-05-31T19:18:40Z

The one-to-one matching logic seems to be a major departure from the previous package version as I recall it. May I ask, what's the reason for this departure?

vincentarelbundock · 2018-05-31T19:19:12Z

No departure. It was always like that.

sumtxt · 2018-05-31T19:25:46Z

But in previous package versions I never got "NA" for a call countrycode(260, "cown", "country.name"). Why now? Up to version 0.19 from February this year, I get "Federal Republic of Germany" but now with version 1.0.0 its NA.

vincentarelbundock · 2018-05-31T19:26:57Z

@cjyetman

I'm not sure why you find the left_join cumbersome. My current workflow looks like this:

> # Load
> library(tidyverse)
> library(countrycode)
>
> # Simulate data
> x <- tribble(
+ ~cown, ~year, ~var1,
+ 255,      1985,  1,
+ 255,      2000,  3,
+ 2,      1985,  2
+ )
> y <- tribble(
+ ~iso3c, ~year, ~var2,
+ 'DEU',      1985,  1,
+ 'DEU',      2000,  5,
+ 'USA',      1985,  2,
+ 'DZA',      2000,  3
+ )
>
> # Left merge into panel data
> panel <- countrycode::codelist_panel %>%
+          select(iso3c, cown, year)
> panel <- purrr::reduce(list(panel, x, y), left_join)
Joining, by = c("cown", "year")
Joining, by = c("iso3c", "year")
>
> # Check the result
> panel %>% filter(!is.na(var1) | !is.na(var2))
  iso3c cown year var1 var2
1   DZA  615 2000   NA    3
2   DEU  260 1985   NA    1
3   DEU  255 2000    3    5
4   USA    2 1985    2    2

cjyetman · 2018-05-31T19:29:36Z

It just seems a bit awkward, especially for a package that generally makes things incredibly easy. I don't have a solution in mind, but I might start thinking of one.

vincentarelbundock · 2018-05-31T19:32:53Z

Sounds good. Let me know if you have some ideas. I'm very interested.

w.r.t. to the current issue, do you think we should include entries in dictionary_static with empty regex but values for cown and country.name.en? Those entries would be functionality-constrained, but it would solve this user's specific problem.

The important thing would be to ensure none of the entries duplicate what's in other rows (but our test suite should catch that).

cjyetman · 2018-05-31T19:35:11Z

@sumtxt I suppose it could be considered a bug...

devtools::install_github("vincentarelbundock/countrycode", ref = "0.19")
library(countrycode)
countrycode(260, "cown", "country.name")
# [1] "Federal Republic of Germany"
countrycode("Federal Republic of Germany", "country.name", "cown")
# [1] NA
# Warning messages:
#   1: In countrycode("Federal Republic of Germany", "country.name", "cown") :
#   Some values were not matched unambiguously: Federal Republic of Germany
# 
# 2: In countrycode("Federal Republic of Germany", "country.name", "cown") :
#   Some strings were matched more than once, and therefore set to <NA> in the result: Federal Republic of Germany,260,255

vincentarelbundock · 2018-05-31T19:35:13Z

arrgh, that's can't really work, since now we're merging the dictionary based on unique regexes

Edit: And we want the regex for "Federal Republic of Germany" to map onto the other code.

cjyetman · 2018-05-31T19:36:18Z

fyi... signing off for a bit... I'll come back to this in the coming days

vincentarelbundock · 2022-08-24T20:10:32Z

Thanks again for opening this issue. I still recognize that this is a problem, but unfortunately this problem can only be satisfactorily addressed by a fundamental re-write of countrycode, to allow "many-to-one" mappings.

We already have a (dormant but open) issue for this, which you can follow if there is interest: #186

I have also note the problem in a new "Frequently Requested Feature" issue, so I will close this one here. I'm only closing to avoid duplication and to clean up the repo. If Many-to-One is implemented, the specific case of Germany CoW codes will automatically be solved.

Thanks for your patience!

vincentarelbundock mentioned this issue Jun 3, 2018

Uniquely identify Russia vs. Soviet Union #180

Closed

cjyetman added match issue many-to-one issue labels Oct 27, 2018

cjyetman mentioned this issue Dec 23, 2020

"Origin code not supported" error #256

Closed

vincentarelbundock closed this as completed Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COW codes Germany/Vietnam #179

COW codes Germany/Vietnam #179

sumtxt commented May 31, 2018 •

edited

Loading

vincentarelbundock commented May 31, 2018

vincentarelbundock commented May 31, 2018

sumtxt commented May 31, 2018 •

edited

Loading

vincentarelbundock commented May 31, 2018

sumtxt commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018 •

edited

Loading

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

sumtxt commented May 31, 2018

vincentarelbundock commented May 31, 2018

sumtxt commented May 31, 2018

vincentarelbundock commented May 31, 2018 •

edited

Loading

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018 •

edited

Loading

cjyetman commented May 31, 2018

vincentarelbundock commented Aug 24, 2022

COW codes Germany/Vietnam #179

COW codes Germany/Vietnam #179

Comments

sumtxt commented May 31, 2018 • edited Loading

vincentarelbundock commented May 31, 2018

vincentarelbundock commented May 31, 2018

sumtxt commented May 31, 2018 • edited Loading

vincentarelbundock commented May 31, 2018

sumtxt commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018 • edited Loading

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

sumtxt commented May 31, 2018

vincentarelbundock commented May 31, 2018

sumtxt commented May 31, 2018

vincentarelbundock commented May 31, 2018 • edited Loading

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018

cjyetman commented May 31, 2018

vincentarelbundock commented May 31, 2018 • edited Loading

cjyetman commented May 31, 2018

vincentarelbundock commented Aug 24, 2022

sumtxt commented May 31, 2018 •

edited

Loading

sumtxt commented May 31, 2018 •

edited

Loading

vincentarelbundock commented May 31, 2018 •

edited

Loading

vincentarelbundock commented May 31, 2018 •

edited

Loading

vincentarelbundock commented May 31, 2018 •

edited

Loading