feature: new tests added for tsne to expand test coverage #2229

yuejiaointel · 2024-12-17T18:18:24Z

Description

Added additional tests in sklearnex/manifold/tests/test_tsne.py to expand the test coverage for t-SNE algorithm.

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

codecov · 2024-12-17T18:59:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Flag	Coverage Δ
github	`83.18% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

sklearnex/manifold/tests/test_tsne.py

yuejiaointel · 2024-12-18T15:59:14Z

/intelci: run

ethanglaser · 2024-12-19T06:52:25Z

/intelci: run

sklearnex/manifold/tests/test_tsne.py

…nt results, merge previous deleted gpu test to complex test

yuejiaointel · 2025-01-06T23:01:52Z

/intelci: run

sklearnex/manifold/tests/test_tsne.py

david-cortes-intel · 2025-01-07T08:08:58Z

It looks like we don't have any test here nor in daal4py that would be checking that the results from TSNE make sense beyond having the right shape and non-missingness.

Since there's a very particular dataset here for the last test, it'd be helpful to add other assertions there along the lines of checking that the embeddings end up making some points closer than others as would be expected given the input data.

…or parametrization names, removed extra tests

yuejiaointel · 2025-01-08T01:28:07Z

Hi David,
About the last comment, I think that is a good test to add! I spent some time thinking it through and have added a logic check in the final test to evaluate the overlap of close neighbors. Here’s a summary of the steps I implemented:

get a distance array where [i, j] is Euclidean distance of point i and j in original space, same for tsne embedding space
rank distances for each point wrt first column in original space, also for embedding space
get top 5 neighbors of each point in original and embedding space see how many are same by dividing them
get a mean of all fractions it should represent how the original and embedding space are similar for the most 5 closest points
check if that mean is > 0.6
Let me know your thoughts on this approach or if you believe it could be improved further.
Thx a lot :D
Yue

david-cortes-intel · 2025-01-08T08:22:23Z

Hi David, About the last comment, I think that is a good test to add! I spent some time thinking it through and have added a logic check in the final test to evaluate the overlap of close neighbors. Here’s a summary of the steps I implemented:

get a distance array where [i, j] is Euclidean distance of point i and j in original space, same for tsne embedding space

rank distances for each point wrt first column in original space, also for embedding space

get top 5 neighbors of each point in original and embedding space see how many are same by dividing them

get a mean of all fractions it should represent how the original and embedding space are similar for the most 5 closest points

check if that mean is > 0.6
Let me know your thoughts on this approach or if you believe it could be improved further.
Thx a lot :D
Yue

I think given the characteristics of the data that you are passing, it could be done by selecting some hard-coded set of points by index from "Complex Dataset1" that should end up being similar, and some selected set of points that should end up being dissimilar to the earlier ones; with the test then checking that the euclidean distances in the embedding space among each point from the first set are smaller than the distances between each point in the first set and each point in the second set.

Also maybe "Complex Dataset2" is not needed.

…ding space

yuejiaointel · 2025-01-08T19:26:57Z

Hi David, About the last comment, I think that is a good test to add! I spent some time thinking it through and have added a logic check in the final test to evaluate the overlap of close neighbors. Here’s a summary of the steps I implemented:

get a distance array where [i, j] is Euclidean distance of point i and j in original space, same for tsne embedding space

rank distances for each point wrt first column in original space, also for embedding space

get top 5 neighbors of each point in original and embedding space see how many are same by dividing them

get a mean of all fractions it should represent how the original and embedding space are similar for the most 5 closest points

check if that mean is > 0.6
Let me know your thoughts on this approach or if you believe it could be improved further.
Thx a lot :D
Yue

I think given the characteristics of the data that you are passing, it could be done by selecting some hard-coded set of points by index from "Complex Dataset1" that should end up being similar, and some selected set of points that should end up being dissimilar to the earlier ones; with the test then checking that the euclidean distances in the embedding space among each point from the first set are smaller than the distances between each point in the first set and each point in the second set.

Also maybe "Complex Dataset2" is not needed.

Hi David!
I fixed the logic based on your suggestion, and here is my understanding. First get a group A with similar points and group B with different points from group A, then check in embedding space distance b/t any 2 points in group A should be less than that point to any point in group B. I run the CI many times and one problem with this approach is that it fails sometimes for GPU devices, in these cases the embedding did not keep close points close, and it only occur on pipeline runs without problem on local machine. Not sure if I should create another ticket to investigate on that. I also removed complex test 2.
Best,
Yue

…gical test

… array

…r logical test

yuejiaointel · 2025-01-09T04:34:53Z

/intelci: run

david-cortes-intel · 2025-01-09T08:05:26Z

sklearnex/manifold/tests/test_tsne.py

+    assert_allclose(tsne_1, tsne_2, rtol=1e-5)
+
+
+def compute_pairwise_distances(data):


Better to use the implementation in this same package:
https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/sklearnex/metrics/pairwise.py

Or failing that, in the dependencies, like these:
https://scikit-learn.org/dev/modules/generated/sklearn.metrics.pairwise_distances.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

david-cortes-intel · 2025-01-09T08:07:47Z

sklearnex/manifold/tests/test_tsne.py

+
+    # Ensure close points in original space remain close in embedding
+    group_a_indices = [0, 1, 2]  # Hardcoded index of similar points
+    group_b_indices = [3, 4, 5]  # Hardcoded index of dissimilar points from a


How about using the points that were already available and had also differences in signatures - e.g.

[2e9, 2e-9, -2e9, -2e-9]

david-cortes-intel · 2025-01-09T08:08:11Z

sklearnex/manifold/tests/test_tsne.py

+@pytest.mark.parametrize(
+    "X,n_components,perplexity,expected_shape",
+    [
+        pytest.param(


Since there's only one parameterization, maybe these could be defined inside the function body.

david-cortes-intel · 2025-01-09T08:09:19Z

sklearnex/manifold/tests/test_tsne.py

+    dtype,
+):
+    """
+    TSNE test covering multiple functionality and edge cases using parameterization.


I think this comment is redundant given the function name.

david-cortes-intel · 2025-01-09T08:10:44Z

sklearnex/manifold/tests/test_tsne.py

+    # Check for distance b/t two points in group A < distance of this point and any point in group B
+    for i in group_a_indices:
+        for j in group_a_indices:
+            if i != j:


This condition should not be needed,

david-cortes-intel · 2025-01-09T08:12:29Z

sklearnex/manifold/tests/test_tsne.py

+            0.5,
+            (10, 2),
+            False,
+            id="Extremely low perplexity",


Considering what this is testing, perhaps it could add a check for no infinites , no NaNs, and no all-zeros in the embeddings in this test too.

david-cortes-intel · 2025-01-09T08:13:12Z

sklearnex/manifold/tests/test_tsne.py

+    X_df = _convert_to_dataframe(X, sycl_queue=queue, target_df=dataframe)
+    tsne = TSNE(n_components=2, perplexity=2.0).fit(X_df)
+    assert "daal4py" in tsne.__module__
+    assert hasattr(tsne, "n_components"), "TSNE missing 'n_components' attribute."


This one should already get tested as part of the line that comes next.

david-cortes-intel · 2025-01-09T08:18:10Z

sklearnex/manifold/tests/test_tsne.py

+    tsne = TSNE(n_components=2, perplexity=2.0).fit(X_df)
+    assert "daal4py" in tsne.__module__
+    assert hasattr(tsne, "n_components"), "TSNE missing 'n_components' attribute."
+    assert tsne.n_components == 2, "TSNE 'n_components' attribute is incorrect."


Could also check for 'perplexity' as it was passed to the constructor too.

feature: new tests added for tsne to expand test coverage

c686edd

yuejiaointel requested review from Alexsandruss, samir-nasibli and icfaust as code owners December 17, 2024 18:18

ethanglaser marked this pull request as draft December 17, 2024 19:02

test: additional test for gpu and golden data embedding test for tsne

f3f5223

icfaust reviewed Dec 18, 2024

View reviewed changes

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

yue.jiao added 3 commits December 18, 2024 08:10

fix: fix format by running black and isort test_tsne.py

10da764

fix: const test check shape instead of str output

2f3e9fa

fix: test removing raise error test

739a90c

yuejiaointel marked this pull request as ready for review December 19, 2024 00:00

ethanglaser requested review from Vika-F and ethanglaser December 19, 2024 06:53

david-cortes-intel reviewed Dec 19, 2024

View reviewed changes

yue.jiao added 2 commits December 19, 2024 08:37

fix: fix test based on comments

822e614

fix: parametize basic test, use rng for ramdom datasets for independe…

c6bf0bd

…nt results, merge previous deleted gpu test to complex test

david-cortes-intel reviewed Jan 7, 2025

View reviewed changes

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

david-cortes-intel reviewed Jan 7, 2025

View reviewed changes

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

sklearnex/manifold/tests/test_tsne.py Outdated Show resolved Hide resolved

sklearnex/manifold/tests/test_tsne.py Show resolved Hide resolved

fix: additional tests for complex and sparse data, use pytest param f…

5d2da20

…or parametrization names, removed extra tests

fix: fix the logic to ensure tsne can keep close point close in embed…

e95f5a3

…ding space

yue.jiao added 2 commits January 8, 2025 13:14

fix: logic test amke group a and b more different

44f3c14

fix: print and more differetn for input on group a and group b for lo…

cba1ce9

…gical test

yue.jiao added 6 commits January 8, 2025 16:00

fix: add check to check for dpctl array check and convert it to numpy…

ba7658e

… array

fix: format fix for tsne tests

dc04722

fix: use _as_numpy to convert to numpy obj

9791ea4

fix: fix tsne format

11f5edc

test: investigate on ci why gpu test is not getting correct result fo…

8c1dc28

…r logical test

test-ci: don't comment other tests

1fbc7f0

fici-testsee changes with smaller preplexity

a57cd08

yuejiaointel requested review from napetrov, homksei and ahuber21 as code owners January 9, 2025 05:23

fix: remove print

28f9815

david-cortes-intel reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: new tests added for tsne to expand test coverage #2229

feature: new tests added for tsne to expand test coverage #2229

yuejiaointel commented Dec 17, 2024 •

edited by icfaust

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

yuejiaointel commented Dec 18, 2024

ethanglaser commented Dec 19, 2024

yuejiaointel commented Jan 6, 2025

david-cortes-intel commented Jan 7, 2025

yuejiaointel commented Jan 8, 2025

david-cortes-intel commented Jan 8, 2025

yuejiaointel commented Jan 8, 2025 •

edited

Loading

yuejiaointel commented Jan 9, 2025

david-cortes-intel Jan 9, 2025 •

edited

Loading

david-cortes-intel Jan 9, 2025

david-cortes-intel Jan 9, 2025

david-cortes-intel Jan 9, 2025

david-cortes-intel Jan 9, 2025

david-cortes-intel Jan 9, 2025

david-cortes-intel Jan 9, 2025

david-cortes-intel Jan 9, 2025

		assert_allclose(tsne_1, tsne_2, rtol=1e-5)


		def compute_pairwise_distances(data):

feature: new tests added for tsne to expand test coverage #2229

Are you sure you want to change the base?

feature: new tests added for tsne to expand test coverage #2229

Conversation

yuejiaointel commented Dec 17, 2024 • edited by icfaust Loading

Description

codecov bot commented Dec 17, 2024 • edited Loading

Codecov Report

yuejiaointel commented Dec 18, 2024

ethanglaser commented Dec 19, 2024

yuejiaointel commented Jan 6, 2025

david-cortes-intel commented Jan 7, 2025

yuejiaointel commented Jan 8, 2025

david-cortes-intel commented Jan 8, 2025

yuejiaointel commented Jan 8, 2025 • edited Loading

yuejiaointel commented Jan 9, 2025

david-cortes-intel Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

david-cortes-intel Jan 9, 2025

Choose a reason for hiding this comment

david-cortes-intel Jan 9, 2025

Choose a reason for hiding this comment

david-cortes-intel Jan 9, 2025

Choose a reason for hiding this comment

david-cortes-intel Jan 9, 2025

Choose a reason for hiding this comment

david-cortes-intel Jan 9, 2025

Choose a reason for hiding this comment

david-cortes-intel Jan 9, 2025

Choose a reason for hiding this comment

david-cortes-intel Jan 9, 2025

Choose a reason for hiding this comment

yuejiaointel commented Dec 17, 2024 •

edited by icfaust

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

yuejiaointel commented Jan 8, 2025 •

edited

Loading

david-cortes-intel Jan 9, 2025 •

edited

Loading