Discuss ways to define suggestions #2246

SilvaQ · 2022-01-10T04:49:42Z

SilvaQ
Jan 10, 2022

I have my own collection of misspelled words, and how do I explicitly designate misspelled words through the Trie dictionary when I'm sure that 100% of the current word is misspelled and I know that 100% of it should be a certain word, rather than by traversing the binary tree?

For example:

Wuman -> woman
Goode ->good.

I'm using an example here just to describe the problem, and these words are not related to the problem.

If there is a way like this:

[adam,adem]

When encountering the word "Adam" and flagged as an error, its recommendation is adem

Jason3S · 2022-01-10T07:32:36Z

Jason3S
Jan 10, 2022
Maintainer

I think this is a great thing to support!

This is a duplicate of #1416

There is a non-ideal work-around.

You can make a word list that looks like this:

!wuman
woman
!goode
good
!adam
adem

Adding ! to the beginning of the word will forbid it.

0 replies

SilvaQ · 2022-01-10T07:45:03Z

SilvaQ
Jan 10, 2022
Author

I think this is a great thing to support!

This is a duplicate of #1416

There is a non-ideal work-around.

You can make a word list that looks like this:
!wuman
woman
!goode
good
!adam
adem
Adding ! to the beginning of the word will forbid it.

I've thought about flaggedWord, but it's based on recommendations, and my one-to-one is sometimes not in the first place or even at all, especially if the wrong word has a different character count than the recommended word，

example:

thx -> thanks
ext -> extra
something like those .

I used to achieve one-to-one through map. I would go through map first and then take out the problem to the recommendation system.

Now we are all going through the recommendation system, there is no option of one to one detection first.

Is it possible for us to consider providing a one-to-one map, go through it first, and then go through the recommendation system for the rest?

0 replies

Jason3S · 2022-01-10T08:33:11Z

Jason3S
Jan 10, 2022
Maintainer

I agree with you. Sometimes the words are not even similar, so they would not be suggested.

0 replies

Jason3S · 2022-01-10T09:46:53Z

Jason3S
Jan 10, 2022
Maintainer

@SilvaQ,

Are you caring more about the dictionary file format or supporting suggestions?

0 replies

SilvaQ · 2022-01-11T03:23:51Z

SilvaQ
Jan 11, 2022
Author

@Jason3S both。
Dictionary support makes more sense to me, but I want to be able to use both dictionaries and configuration customization

0 replies

Jason3S · 2022-01-11T08:25:34Z

Jason3S
Jan 11, 2022
Maintainer

As a first pass, I was thinking of going for configuration based first before changing the dictionary format.

Relatively speaking, I was assuming the number of suggestion sets to be much smaller than the number of words. So having a simple human readable format would be just fine.

0 replies

SilvaQ · 2022-01-11T08:35:09Z

SilvaQ
Jan 11, 2022
Author

In my case, some words cannot appear on the Internet because they are marked as sensitive or banned
It's also the nature of my current language that I have 400,000 proofread words and 250,000 frequently write wrong words

I want the Trie to be more of an algorithmic consideration, but more of a security consideration。
Because with it I can provide services without worrying about security issues caused by words

for personal local use not available to the general public, or a small number of one-to-one recommendations I would consider configuration。

Is that why I say both, and prefer to provide services through dictionaries

0 replies

Jason3S · 2022-01-11T09:17:22Z

Jason3S
Jan 11, 2022
Maintainer

So you have a lot of "sensitive" words to store. I would consider the trie format a form of accidental obfuscation than encryption.

In any case. I had been thinking about an efficient way to store attributes to words. Suggestions is a form of attribution as well as word usage frequency (can be used to sort suggestions).

The idea was a separate section in the trie file.

For suggestions, it might be like this:

word id	suggestions word ids
1453	234, 345, 212, 334
954	234, 873, 372, 232

A word id is the collection of branches taken to walk the trie to make the word.

0 replies

SilvaQ · 2022-01-11T09:28:17Z

SilvaQ
Jan 11, 2022
Author

I think your idea is brilliant。
but I don't think it's a good idea to put words and attributes together in a Trie structure, which will make the dictionary bulky and make the logic bigger and bigger in the future。

And attributes are not like words, so I think we should use more direct storage, like hastable。

But with Trie's O(k) nature, I think it might be fun to do it with a Trie that has separate words and attributes, and it could be unified

0 replies

Jason3S · 2022-01-11T09:34:54Z

Jason3S
Jan 11, 2022
Maintainer

Example Word Id:

Trie from cspell-trie README.md

Offset  Output
------- --------
        TrieXv1
        base=10
0       *
1       d,r
2       g
3       n2
4       *e1,i3,s
5       k4
6       l5
7       a6
8       t7,w7

word	id segments	id in bits
walk	1 0 0 0	(1)1000 -> 24

0 replies

Jason3S · 2022-01-11T09:41:18Z

Jason3S
Jan 11, 2022
Maintainer

I think your idea is brilliant。 but I don't think it's a good idea to put words and attributes together in a Trie structure, which will make the dictionary bulky and make the logic bigger and bigger in the future。

And attributes are not like words, so I think we should use more direct storage, like hastable。

But with Trie's O(k) nature, I think it might be fun to do it with a Trie that has separate words and attributes, and it could be unified

I was talking about the file format. In memory attributes would be expanded to be a hash table for speed.

Suggestions or word frequency are just attributes on a word. They should not be in the Trie structure, but exist in parallel.

0 replies

SilvaQ · 2022-01-11T09:42:30Z

SilvaQ
Jan 11, 2022
Author

I thought about it during my one-on-one recommendation yesterday, but I always thought it wasn't very clean.

When we have multiple different types of properties the relationship should be many-to-many, will we be faced with multiple special symbol segmentation analyses when we design this way? And bad play can even affect the selection of recommendations。

I did not have a good understanding of the design idea of our project, nor did I find the developer documentation, so I read the code to get a general understanding.

The only thing I am worried about now is whether your good suggestions will be misled by me because OF my incomplete understanding of the current structure of the project。

Is there any way that we can make a rough draft and try it out together to see if it works the way we want it to?

0 replies

SilvaQ · 2022-01-11T09:48:49Z

SilvaQ
Jan 11, 2022
Author

Would it be cleaner to set up two tries and write a file like the current Trie head and read it into memory and then change it to another structure than to use one trie and cram everything in?

some thing like this:

Offset  Output
------- --------
        TrieXv1
        base=10

          /** This is where all the attributes, categories, and relationships go */


0       *
1       d,r
2       g
3       n2
4       *e1,i3,s
5       k4
6       l5
7       a6
8       t7,w7

0 replies

SilvaQ · 2022-01-11T10:02:57Z

SilvaQ
Jan 11, 2022
Author

If you do use Head text to describe attribute information, I recommend using the Toml specification

0 replies

SilvaQ · 2022-01-11T10:16:23Z

SilvaQ
Jan 11, 2022
Author

I think we have a bit of a bias in our understanding of attributes. One-to-one recommendation, frequency of word use I don't think can be treated as attributes, or special treatment and need to be bundled with the source word. By attributes I mean a small number of different types of tags for a word:

This word is a verb
There are suggestions for this word
This word has three syllables
This word has n characters

etc.

The Trie should point to the ID of these properties, not store it, that's what I mean

0 replies

SilvaQ · 2022-01-11T10:24:15Z

SilvaQ
Jan 11, 2022
Author

Can our tool provide developer guidance documentation?

Help us learn, help us implement some functions when we can.

For me, I can implement it in other languages, such as Python, Golan

0 replies

Jason3S · 2022-01-11T10:52:17Z

Jason3S
Jan 11, 2022
Maintainer

Word attributes cannot be in the Trie because it will explode the size of the Trie. All attribute need to be in a separate data structure. This is because the Trie is not a pure Trie. On the outside, it looks like a standard Trie. But to preserve space, all words that share the same suffix set use the same trie node.

In the Trie, the words talking and ring share the same sub-trie ing. Because of this, adding attributes into the Trie leaf nodes would prevent any sharing of sub-tries.

0 replies

SilvaQ · 2022-01-12T02:31:14Z

SilvaQ
Jan 12, 2022
Author

I think sharing substrings is a great optimization, but it's possible that other Trie features don't work as well。
I thought about how to do this based on the characteristics of the Trie discussed above, and spent a diagram, mainly to use the path ID as an aid to solve the problem。

But in the end I was a little hesitant about whether it was wise for us to play around with the Trie structure. After all, the two sides of the Trie structure are recommendations. We kind of force it to do other things and it forces ourselves

I think our RFC is more powerful, and I think it's probably smarter for us to focus on that

0 replies

Jason3S · 2022-01-12T06:57:04Z

Jason3S
Jan 12, 2022
Maintainer

Your diagram is an excellent example of the indexing.

0 replies

Jason3S · 2022-01-12T06:58:22Z

Jason3S
Jan 12, 2022
Maintainer

Did you mean this?

Meaning to add words like: hte@the?

0 replies

SilvaQ · 2022-01-12T07:09:39Z

SilvaQ
Jan 12, 2022
Author

I mean let's not mess with the Trie, let's do it through the configuration file like this:

In the following picture I want to split the path of the recommendation word by the @ sign。

hte --> e -> e,curPath@wordLen@suggestionWordPath(some thing like dec(010101))
If we find e, and its path matches currentPath, we assume it has a recommendation, we find the recommendation by following the recommendation path, for example recommendation length 3, decimal value 6 then the path should be 110, (Dec (110)=6)

0 replies

Jason3S · 2022-01-12T07:29:11Z

Jason3S
Jan 12, 2022
Maintainer

For suggestions, I was planning on using a word list like this:

Wikipedia:Lists of common misspellings/For machines - Wikipedia

Notice that there can be multiple suggestions for a word.

It would be possible to take:

archeaologist->archeologist, archaeologist

and turn it into:

archeaologist@archeologist
archeaologist@archaeologist

This is logicly the same as:

aa@ba
aa@ca

Which looks like:

a - a - @ - b - a
          \ c /

There are a few challenges:

preserving word order is not possible. (this is assuming the suggestions are in order of preference)
Large word lists won't be able to take advantage of sharing sub-tries. (this is the main way that the size of the trie is smaller than the original list).

0 replies

Jason3S · 2022-01-12T10:02:21Z

Jason3S
Jan 12, 2022
Maintainer

Use flagWords in this way is possible:

{
  "flagWords": [
    ["whitelist", "allowlist"],
    ["blacklist", "denylist"],
    // other words
  ]
}

I do not plan on supporting the use of RegExp replacements. It would make spell checking very very slow. That is more of a "find-and-replace" type of operation. For large word lists, that would be very expensive.

In the longer term, I would rather have suggestion lists act like dictionaries. Dictionaries can be turned on / off or even replaced by defining a new dictionary with the same name.

My plan was to go with simple text based suggestion lists that can be .gzed.

0 replies

Jason3S · 2022-01-12T10:03:24Z

Jason3S
Jan 12, 2022
Maintainer

See #2247

2 replies

SilvaQ Jan 12, 2022
Author

good choose，I agree the proposal

SilvaQ Jan 12, 2022
Author

Also hope to pass vscode regular expression sort also added

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss ways to define suggestions #2246

{{title}}

Replies: 24 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Discuss ways to define suggestions #2246

SilvaQ Jan 10, 2022

Replies: 24 comments · 2 replies

Jason3S Jan 10, 2022 Maintainer

SilvaQ Jan 10, 2022 Author

Jason3S Jan 10, 2022 Maintainer

Jason3S Jan 10, 2022 Maintainer

SilvaQ Jan 11, 2022 Author

Jason3S Jan 11, 2022 Maintainer

SilvaQ Jan 11, 2022 Author

Jason3S Jan 11, 2022 Maintainer

SilvaQ Jan 11, 2022 Author

Jason3S Jan 11, 2022 Maintainer

Jason3S Jan 11, 2022 Maintainer

SilvaQ Jan 11, 2022 Author

SilvaQ Jan 11, 2022 Author

SilvaQ Jan 11, 2022 Author

SilvaQ Jan 11, 2022 Author

SilvaQ Jan 11, 2022 Author

Jason3S Jan 11, 2022 Maintainer

SilvaQ Jan 12, 2022 Author

Jason3S Jan 12, 2022 Maintainer

Jason3S Jan 12, 2022 Maintainer

SilvaQ Jan 12, 2022 Author

Jason3S Jan 12, 2022 Maintainer

Jason3S Jan 12, 2022 Maintainer

Jason3S Jan 12, 2022 Maintainer

SilvaQ Jan 12, 2022 Author

SilvaQ Jan 12, 2022 Author

SilvaQ
Jan 10, 2022

Replies: 24 comments 2 replies

Jason3S
Jan 10, 2022
Maintainer

SilvaQ
Jan 10, 2022
Author

Jason3S
Jan 10, 2022
Maintainer

Jason3S
Jan 10, 2022
Maintainer

SilvaQ
Jan 11, 2022
Author

Jason3S
Jan 11, 2022
Maintainer

SilvaQ
Jan 11, 2022
Author

Jason3S
Jan 11, 2022
Maintainer

SilvaQ
Jan 11, 2022
Author

Jason3S
Jan 11, 2022
Maintainer

Jason3S
Jan 11, 2022
Maintainer

SilvaQ
Jan 11, 2022
Author

SilvaQ
Jan 11, 2022
Author

SilvaQ
Jan 11, 2022
Author

SilvaQ
Jan 11, 2022
Author

SilvaQ
Jan 11, 2022
Author

Jason3S
Jan 11, 2022
Maintainer

SilvaQ
Jan 12, 2022
Author

Jason3S
Jan 12, 2022
Maintainer

Jason3S
Jan 12, 2022
Maintainer

SilvaQ
Jan 12, 2022
Author

Jason3S
Jan 12, 2022
Maintainer

Jason3S
Jan 12, 2022
Maintainer

Jason3S
Jan 12, 2022
Maintainer

SilvaQ Jan 12, 2022
Author

SilvaQ Jan 12, 2022
Author