Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can the station registry accept UTF-8 in a station description? #7

Open
tkeffer opened this issue Oct 1, 2022 · 5 comments
Open

Can the station registry accept UTF-8 in a station description? #7

tkeffer opened this issue Oct 1, 2022 · 5 comments
Assignees

Comments

@tkeffer
Copy link
Contributor

tkeffer commented Oct 1, 2022

In /var/log/nginx/error.log, I am seeing a lot of errors of the sort

2022/10/01 12:05:29 [error] 940#0: *105063 FastCGI sent in stderr: "DBD::mysql::st execute failed: Incorrect string value: '\xC4\x81zene...'
 for column 'description' at row 1 at /var/www/html/register/register.cgi line 429" while reading response header from upstream, client: 62.
85.109.69, server: weewx.com, request: "GET /register/register.cgi?station_url=https%3A%2F%2Fwww.wunderground.com%2Fpersonal-weather-station
%2Fdashboard%3FID=IRGASPIL3&description=Atg%C4%81zene%2C+R%C4%ABga%2C+Latvija&latitude=56.9172&longitude=24.0822&station_type=WMR200&station
_model=WMR200A&python_info=3.10.6&platform_info=Linux-5.15.0-1015-raspi-aarch64-with-glibc2.35&weewx_info=4.8.0 HTTP/1.1", upstream: "fastcg
i://unix:/var/run/fcgiwrap.socket:", host: "weewx.com"

What they all seem to have in common is the use of UTF-8 in the field description.

Perl is a mystery to me, so I can't tell if it's rejecting UTF-8 characters.

@matthewwall
Copy link
Contributor

i just tried an alter on the description field to make it utf8. need to do more research to see what cause this behavior to start, but i suspect an upgrade to mysql would have done it.

so far i do not see more messages like this in error.log, but we'll see...

@matthewwall
Copy link
Contributor

here is the table before and after, and the alter command.

alter table `stations` change `description` `description` varchar(255) character set utf8mb4 default NULL;
mysql> show create table stations \G
*************************** 1. row ***************************
       Table: stations
Create Table: CREATE TABLE `stations` (
  `station_url` varchar(255) NOT NULL,
  `description` varchar(255) DEFAULT NULL,
  `latitude` float DEFAULT NULL,
  `longitude` float DEFAULT NULL,
  `station_type` varchar(64) DEFAULT NULL,
  `station_model` varchar(128) DEFAULT NULL,
  `weewx_info` varchar(64) DEFAULT NULL,
  `python_info` varchar(64) DEFAULT NULL,
  `platform_info` varchar(128) DEFAULT NULL,
  `last_addr` varchar(16) DEFAULT NULL,
  `last_seen` int DEFAULT NULL,
  UNIQUE KEY `index_url` (`station_url`),
  KEY `index_addr` (`last_addr`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
mysql> show create table stations \G
*************************** 1. row ***************************
       Table: stations
Create Table: CREATE TABLE `stations` (
  `station_url` varchar(255) NOT NULL,
  `description` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  `latitude` float DEFAULT NULL,
  `longitude` float DEFAULT NULL,
  `station_type` varchar(64) DEFAULT NULL,
  `station_model` varchar(128) DEFAULT NULL,
  `weewx_info` varchar(64) DEFAULT NULL,
  `python_info` varchar(64) DEFAULT NULL,
  `platform_info` varchar(128) DEFAULT NULL,
  `last_addr` varchar(16) DEFAULT NULL,
  `last_seen` int DEFAULT NULL,
  UNIQUE KEY `index_url` (`station_url`),
  KEY `index_addr` (`last_addr`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

@matthewwall
Copy link
Contributor

actually, that latin1 is kind of obvious. when did that appear? pretty sure it was not part of the original table creation. so perhaps a mysql upgrade added that to make it explicit, thus breaking any inserts that contain utf8?

sorry, but character encodings are one of my many weaknesses

@matthewwall
Copy link
Contributor

the latin1 is there apparently because i did not specify a characterset when i originally created the tables eons ago, so it defaulted to latin1. apparently mysql 8 (running on weewx.com as of this writing nov 2022) defaults to utf8 (not sure whether that is 3-byte or 4-byte).

i'm not crazy about altering everything in the database. description should obviously accept non-latin characters, but should any other fields?

@tkeffer
Copy link
Contributor Author

tkeffer commented Nov 4, 2022

Interesting.

Internally, I'm sure MySQL stores it as a byte string. However, it appears that the database engine is also enforcing character type safety. That is, if you declare the field as Latin-1, then try to store something that uses non-Latin-1 characters, it will declare an error.

When it comes time to print the field out, the MySQL client will use the encoding to figure out what code point to use.

I believe that UTF-8 domain names are valid these days, although most DNS servers don't support them. To future proof things, I would use utf-8 encoding for station_url as well.

The other fields should be fine.

It's an ancient article, but I recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. It set me straight now 20 years ago!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants