Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters in data column names throw an error in naWhere #15

Open
drag05 opened this issue Apr 29, 2022 · 3 comments
Open

Unicode characters in data column names throw an error in naWhere #15

drag05 opened this issue Apr 29, 2022 · 3 comments

Comments

@drag05
Copy link

drag05 commented Apr 29, 2022

I have the following data

> head(htc, 2)
      25 µL      50 µL     75 µL    100 µL  Accession
1: 1.265836 0.02575365 0.1428066 0.2107820 A0A024R6I7
2:       NA 0.01566025 0.1481060 0.2069585 A0A075B6K4

> dim(htc)
[1] 269   5

> htc[, colSums(is.na(.SD))]
    25 µL     50 µL     75 µL    100 µL Accession 
      200         0         3         0         0 

associated with these naWhere , varp and varn

> naWhere[1:4, ]
     25 µL 50 µL 75 µL 100 µL Accession
[1,] FALSE FALSE FALSE  FALSE     FALSE
[2,]  TRUE FALSE FALSE  FALSE     FALSE
[3,]  TRUE FALSE FALSE  FALSE     FALSE

> dim(naWhere)
[1] 269   5

> colSums(naWhere)
    25 µL     50 µL     75 µL    100 µL Accession 
      200         0         3         0         0 

> varp <- unique(unlist(vars))
> varp
[1] "50 μL"     "75 μL"     "100 μL"    "Accession" "25 μL"   ## maybe apply gtools::mixedsort ?

> varn
[1] "25 μL" "75 μL"

Calculating the leftout columns, throws the following error:

leftOut <- !varp %in% varn & colSums(naWhere[, varp]) > 0

"Error in naWhere[, varp] : subscript out of bounds"

Checking varp against colnames(naWhere):

identical(varp, colnames(naWhere))
FALSE

> intersect(varp, colnames(naWhere))
[1] "Accession"

> varp %in% colnames(naWhere)
[1] FALSE FALSE FALSE  TRUE FALSE

> which(varp %in% colnames(naWhere)) ## "Accession" only (FALSE)
[1] 4
> which(colnames(naWhere) %in% varp) ## "Accession" only (FALSE)
[1] 5

It seems to still be working when comparing varp against varn:

> !varp %in% varn
[1]  TRUE FALSE  TRUE  TRUE FALSE

The error seems to be caused by the presence of unicode characters in names although it seems to be no challenge for varp and varn , as shown by the last code line above. However,

using either seq_along or base::enc2native functions seems to remove the error:

leftOut <- !varp %in% varn & colSums(naWhere[, seq(along=varp)]) > 0

> leftOut
    25 µL     50 µL     75 µL    100 µL Accession 
     TRUE     FALSE      TRUE     FALSE     FALSE 

> varp = enc2native(varp)
> leftOut <- !varp %in% varn & colSums(naWhere[, varp]) > 0
> leftOut
    50 µL     75 µL    100 µL Accession     25 µL 
    FALSE      TRUE     FALSE     FALSE      TRUE 

Please advise, thank you!

@samFarrellDay
Copy link
Contributor

samFarrellDay commented Apr 30, 2022

Would you mind sending me the data? I'll probably implement the seq_along fix if everything else works as intended. I foresee several areas that will need to be fixed to handle unicode characters.

@drag05
Copy link
Author

drag05 commented May 1, 2022

@samFarrellDay I am not proprietary of the data but I could make an artificial set and post it here. So far I have found out that Unicode also impacts the diagnostic plots.

It would be really useful for documents and Shiny.
Otherwise, column names could be changed for the purpose of imputation and then, changed back to Unicode for presentation although working in Unicode throughout would save the overhead.

@drag05
Copy link
Author

drag05 commented May 2, 2022

@samFarrellDay The script below generates a data.table with missing values and Unicode characters.
One observation: Unicode characters can be converted/visualized only if they are defined inside data.table environment.

# generate a data table containing NA values
require(data.table)
L = 1000L
x = list(a = sample(c(runif(L, -1L, 1L), rep(NA, L)), L)
       , b = sample(c(rnorm(L, 1L, 3L), rep(NA, L %/% 2L)), L) 
       , c = sample(rep(1:2, each = 2L), L, replace = TRUE))
dt = as.data.table(x)

# convert column "c" to Unicode characters
dt[, c := ifelse(c == 1L, '25 \u03BCL', '50 \u03BCL')]

# rename dt
setnames(dt, c('Treat \u03B1', 'Treat \u03B2', 'Sample')) 

> dt
          Treat a     Treat ß Sample
   1:          NA          NA  50 µL
   2:          NA          NA  50 µL
   3: -0.86576094        1.12  50 µL
   4:          NA          NA  50 µL

# obs: names(dt) reads Greek "alpha" ('\u03B1') as Latin character "a" 

The script converted "c" vector from list x to Unicode inside data.table. If I had done this in list x and then converted the list to "data.table", as.data.table would have not read the characters as Unicode.
Example:

# alternative

# Unicode Greek letters
greek = c('\u03B1', '\u03B2', '\u03B3', '\u03B4', '\u03B5', '\u03B6', '\u03B7', '\u03B8', '\u03B9',
           '\u03BA', '\u03BB', '\u03BC', '\u03BD', '\u03BE', '\u03BF', '\u03C0', '\u03C1', '\u03C3',
           '\u03C4', '\u03C5', '\u03C6', '\u03C7', '\u03C8', '\u03C9')

# generate list with missing values and Unicode characters
L = 20L
x = list(
         a = sample(c(runif(L, -1L, 1L), rep(NA, L)), L)
       , b = sample(c(rnorm(L, 1L, 3L), rep(NA, L %/% 2L)), L) 
       , c = sample(
                     c(replicate(L, paste0(sample(c(greek, letters, 1:9)
                                         ,  size = 4L, replace = TRUE), collapse = ''))
                                , rep(NA, times = L)) , size = L)
  )

# convert list to data.table
dt = as.data.table(x)

> dt
               a          b                   c
 1:  0.706300090 -0.2082637                <NA>
 2:           NA -1.4747307                <NA>
 3:           NA         NA      <U+03B7>o4<U+03C1>9  <--- not read as Greek letters!
 4: -0.855431452 -0.8188787                <NA>
 5: -0.443747398  2.7301625                <NA>
 6:           NA         NA               2twzz

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants