Implementation options of task to add weight #36

vtyushkevich · 2023-09-11T18:52:33Z

vtyushkevich
Sep 11, 2023
Collaborator

I have two ideas how we can improve the algorithm by adding weight to tokens.

we can define frequency of appearance of words in list of texts which we get in find_similar function. So we can sort all words by frequency and give them float scores, for example from 0 to 1 (the most frequent to the most rare) and then use it in find_similar during calculating final score of similarity.
and other way, there is interesting python library (https://pypi.org/project/wordfreq/), so we can get global frequency of each word in text list and use it in find_similar too.
@quillcraftsman What do you think about it?

quillcraftsman · 2023-09-12T09:41:48Z

quillcraftsman
Sep 12, 2023
Maintainer

I have two ideas how we can improve the algorithm by adding weight to tokens.

we can define frequency of appearance of words in list of texts which we get in find_similar function. So we can sort all words by frequency and give them float scores, for example from 0 to 1 (the most frequent to the most rare) and then use it in find_similar during calculating final score of similarity.

and other way, there is interesting python library (https://pypi.org/project/wordfreq/), so we can get global frequency of each word in text list and use it in find_similar too.
@quillcraftsman What do you think about it?

wordfreq looks interesting. May be we can use this by default with some language (use this coefficients or not). But there is a question about custom coefficients, how can we set these for some specific task (And can we do it with wordfreq). I've noted that wordfreq uses some special file for other languages, may be we can use our custom file.
Anyway I think we can try to use wordfreq and check the difference in result report. May be this library will be really useful later (may be not :))

0 replies

vtyushkevich · 2023-09-13T06:19:28Z

vtyushkevich
Sep 13, 2023
Collaborator Author

Interesting article
https://www.newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python

0 replies

quillcraftsman · 2023-09-13T06:55:02Z

quillcraftsman
Sep 13, 2023
Maintainer

Interesting article https://www.newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python

Yes, got it. I need time to read it closely.

3 replies

vtyushkevich Sep 21, 2023
Collaborator Author

To start with something I suggest to pass to find_similar function a dict which contains important words with weight 0 or 1. When function is executed it counts the number of important words and saves it to the instance of TokenText like a cos. After that we can use it to sort result list of texts

quillcraftsman Sep 21, 2023
Maintainer

@vtyushkevich , only 1 or 0?

vtyushkevich Sep 22, 2023
Collaborator Author

May be at the first stage just to try it
Later it is possible to add ability to define weights from 0 to 1 with floating point, for example the most important word is 1, less important is 0.8 etc
I am thinking about how it will be implemented for user on the frontend. For example, user should only check important words (0 or 1 system), or user should range all words from most important to less, or user should define any integer weights to words?

quillcraftsman · 2023-09-22T08:59:18Z

quillcraftsman
Sep 22, 2023
Maintainer

May be at the first stage just to try it Later it is possible to add ability to define weights from 0 to 1 with floating point, for example the most important word is 1, less important is 0.8 etc I am thinking about how it will be implemented for user on the frontend. For example, user should only check important words (0 or 1 system), or user should range all words from most important to less, or user should define any integer weights to words?

Got it. If some text will be very long and all words will be important we will make many duplicates with words and important words. In this case may be better to set important words in TokenText object. But If we will have a few important words and big text it will be uncomfortable to convert str text to TokenText.

May be we can make both of this options:

send important word to the find_similar function
send important coefficient in the TokenText instance.

???

1 reply

vtyushkevich Sep 22, 2023
Collaborator Author

If am I understand right:
we send to the find_similar list of instances of TokenText where one instance is one important word with some coefficient
then we use it in calculations
Please, correct me, if it is wrong

quillcraftsman · 2023-09-22T16:18:49Z

quillcraftsman
Sep 22, 2023
Maintainer

If am I understand right: we send to the find_similar list of instances of TokenText where one instance is one important word with some coefficient then we use it in calculations Please, correct me, if it is wrong

1 variant:

text = 'one two three'
texts = [TokenText('one', important=1.0), TokenText('two', important=0.0), TokenText('three')]
result = find_similar(text, texts)

2 variant:

text = 'one two three'
texts = ['one', 'two', 'three']
important = [{'one': 1.0}, {'two': 0.0}]
result = find_similar(text, texts, important=important)

Is it yours variant what a you talking about above?

Theoretically we can combine both variants. May be you meant something else, in this case show an example please.

0 replies

vtyushkevich · 2023-09-29T03:46:29Z

vtyushkevich
Sep 29, 2023
Collaborator Author

My variant was closer to 2 variant, something like that:

text = 'one two three'
texts = ['one two', 'one three', 'five six']
important = {'one': 1.0, 'two': 1.0}
result = find_similar(text, texts, important=important)

0 replies

quillcraftsman · 2023-09-29T15:12:58Z

quillcraftsman
Sep 29, 2023
Maintainer

My variant was closer to 2 variant, something like that:

text = 'one two three'
texts = ['one two', 'one three', 'five six']
important = {'one': 1.0, 'two': 1.0}
result = find_similar(text, texts, important=important)

Okay, understand. Like I said we can implement this one or both. Let's start with this one.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

findsimilar

Implementation options of task to add weight #36

{{title}}

Replies: 7 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

findsimilar

Implementation options of task to add weight #36

vtyushkevich Sep 11, 2023 Collaborator

Replies: 7 comments · 4 replies

quillcraftsman Sep 12, 2023 Maintainer

vtyushkevich Sep 13, 2023 Collaborator Author

quillcraftsman Sep 13, 2023 Maintainer

vtyushkevich Sep 21, 2023 Collaborator Author

quillcraftsman Sep 21, 2023 Maintainer

vtyushkevich Sep 22, 2023 Collaborator Author

quillcraftsman Sep 22, 2023 Maintainer

vtyushkevich Sep 22, 2023 Collaborator Author

quillcraftsman Sep 22, 2023 Maintainer

vtyushkevich Sep 29, 2023 Collaborator Author

quillcraftsman Sep 29, 2023 Maintainer

vtyushkevich
Sep 11, 2023
Collaborator

Replies: 7 comments 4 replies

quillcraftsman
Sep 12, 2023
Maintainer

vtyushkevich
Sep 13, 2023
Collaborator Author

quillcraftsman
Sep 13, 2023
Maintainer

vtyushkevich Sep 21, 2023
Collaborator Author

quillcraftsman Sep 21, 2023
Maintainer

vtyushkevich Sep 22, 2023
Collaborator Author

quillcraftsman
Sep 22, 2023
Maintainer

vtyushkevich Sep 22, 2023
Collaborator Author

quillcraftsman
Sep 22, 2023
Maintainer

vtyushkevich
Sep 29, 2023
Collaborator Author

quillcraftsman
Sep 29, 2023
Maintainer