Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat private-use characters like non-printable characters for escaping #4015

Open
4 tasks
RoelN opened this issue Jan 8, 2025 · 7 comments
Open
4 tasks
Assignees

Comments

@RoelN
Copy link

RoelN commented Jan 8, 2025


Example in playground

The following code:

$one: \31;
$pua: \e000;

div::before {
  content: unquote("\"#{$one}\"");
}

div::after {
  content: unquote("\"#{pua}\"");
}

results in the following CSS:

div::before {
  content: "\31 ";
}

div::after {
  content: "\e000";
}

Please see the space added after \31. This is being added when using the Unicode value for the number 1, but not for the Unicode value for characters in the PUA. The latter is correct, there shouldn't be a trailing space.

@ntkme
Copy link
Contributor

ntkme commented Jan 8, 2025

It seems to be a parser bug instead of a serializer bug: https://sass-lang.com/playground/#eJwzNHQoLU5VUCpOLC62Ki4pysxLV7LmUkm0UogxAAJjQ2suh5TUpNJ0BYikXk5qXnpJhoZKoiZMpLA0vyQVJADUlwTWh1tXErquJJCuZKAu3HqS0fUkg/SkAPWkAl2IS1cKuq4UTWsAkQNLNQ==?s=L1C1-L9C43

\31 is parsed to have a string length of 4, that it was parsed into a unquoted string \31 , with an extra space at the end.

The proper behavior here should be that at parse phase \31 should be deserialized as a single character string 1 and it should not get escaped at all during serialization.

@ntkme
Copy link
Contributor

ntkme commented Jan 8, 2025

Likely root cause is at this line: https://github.com/sass/dart-sass/blob/9e6e3bfbd28fa07bd0df63cfcb85d2db9ef9b6c2/lib/src/parse/parser.dart#L472

That there is a special logic that when parsing string as identifier, already escaped ascii number 0-9 (\30 - \39) will be explicitly be parsed as \30 - \39 .

@nex3 I wonder why this special treatment is done during parsing phase. Maybe because escaped 0-9 indicating that this string must always be an identifier? Shouldn't the special escape for numbers at the beginning of an identifier to be applied at serialization based on whether it's outputting an identifier or not?

@nex3
Copy link
Contributor

nex3 commented Jan 9, 2025

The issue here isn't that the space is being added \31, it's that it's not being added after \e000. See Consuming an Escaped Code Point:

  • Otherwise, if codepoint is a non-printable code point, U+0009 CHARACTER TABULATION, U+000A LINE FEED, U+000D CARRIAGE RETURN, or U+000C FORM FEED; or if codepoint is a digit and the start flag is set:

    • Let code be the lowercase hexadecimal representation of codepoint, with no leading 0s.

    • Return "\" + code + " ".

The space itself is part of the CSS syntax for escape codes (see § 4.3.7. Consume an escaped code point). We want to include this consistently in the canonicalized format of parsed identifiers so that equivalent identifiers are always equal, while also ensuring that there isn't weird behavior like the identifier form of 1a being one character longer than the identifier form of 1x.

Edit: Sorry, I'm wrong in that the Sass spec does not mandate a space after \e000 because it's not considered a "non-printable code point". Technically, according to the spec, the canonical form of \e000 should be (that is, the literal U+E000 PRIVATE USE AREA code point). I think that's not a desirable behavior, though; we should define private-use characters to be considered "non-printable" for this purpose. I'm going to move this to the spec repo accordingly.

@nex3 nex3 added the bug Something isn't working label Jan 9, 2025
@nex3 nex3 self-assigned this Jan 9, 2025
@nex3 nex3 removed the bug Something isn't working label Jan 9, 2025
@nex3 nex3 transferred this issue from sass/dart-sass Jan 9, 2025
@nex3 nex3 changed the title Trailing space added for Unicode values for numbers Treat private-use characters like non-printable characters for escaping Jan 9, 2025
@ntkme
Copy link
Contributor

ntkme commented Jan 9, 2025

As far as I know the space is optional, and only required is the next token is a space character or hex character?

@nex3
Copy link
Contributor

nex3 commented Jan 9, 2025

That's right, but because the way it's canonicalized is observable—the SassScript value of the identifier \31 is a four-character unquoted string containing [\, 3, 1, ]—we want to make the canonical form as consistent as possible.

This is a downstream effect of the way we use the "unquoted string" datatype to represent not just identifiers but any CSS value we don't have a dedicated type for, including things like plain-CSS functions and so on. Where quoted strings just store their semantic values, unquoted strings store their syntactic values. This means that identifiers are stored escaped, so we need to be consistent about how they're escaped so we don't have weird issues where, for example, \@x and \64 x , and \64x are all treated as different values despite being semantically identical. This is what the Identifier Escapes proposal was all about.

@ntkme
Copy link
Contributor

ntkme commented Jan 9, 2025

It seems to me that there is a limitation that we cannot clearly tell the difference between an unquoted identifier string or an unquoted non-identifier string, and that’s why we are just parsing it without decoding the escape sequence to force it to be outputted as is.

My question is that why cannot we always decode the escape sequence during parsing stage (as this example in question is not a Sass value from JS script but a value directly in sass source input). In other words, parse \31 as unquoted string 1 during parse stage and later during output serialization, either print a string 1 or \31 or \31 based on the context that we are writing the css output?

@RoelN
Copy link
Author

RoelN commented Jan 9, 2025

Thanks for looking into this. Please note that the space seems to be removed in CSS when concatted with text: https://codepen.io/RoelN/pen/zxORvxV

But then again I'd expect the output of

$one: \31;

div::before {
  content: unquote("\"#{$one}#{$one}#{$one}\"");
}

to be

div::before {
  content: "\31\31\31 ";
}

(with our without the trailing space)

and not

div::before {
  content: "\31 \31 \31 ";
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants