-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: new tests added for tsne to expand test coverage #2229
base: main
Are you sure you want to change the base?
feature: new tests added for tsne to expand test coverage #2229
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Flags with carried forward coverage won't be shown. Click here to find out more. |
/intelci: run |
/intelci: run |
…nt results, merge previous deleted gpu test to complex test
/intelci: run |
It looks like we don't have any test here nor in daal4py that would be checking that the results from TSNE make sense beyond having the right shape and non-missingness. Since there's a very particular dataset here for the last test, it'd be helpful to add other assertions there along the lines of checking that the embeddings end up making some points closer than others as would be expected given the input data. |
…or parametrization names, removed extra tests
Hi David,
|
I think given the characteristics of the data that you are passing, it could be done by selecting some hard-coded set of points by index from "Complex Dataset1" that should end up being similar, and some selected set of points that should end up being dissimilar to the earlier ones; with the test then checking that the euclidean distances in the embedding space among each point from the first set are smaller than the distances between each point in the first set and each point in the second set. Also maybe "Complex Dataset2" is not needed. |
Hi David! |
/intelci: run |
assert_allclose(tsne_1, tsne_2, rtol=1e-5) | ||
|
||
|
||
def compute_pairwise_distances(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use the implementation in this same package:
https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/sklearnex/metrics/pairwise.py
Or failing that, in the dependencies, like these:
https://scikit-learn.org/dev/modules/generated/sklearn.metrics.pairwise_distances.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
|
||
# Ensure close points in original space remain close in embedding | ||
group_a_indices = [0, 1, 2] # Hardcoded index of similar points | ||
group_b_indices = [3, 4, 5] # Hardcoded index of dissimilar points from a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using the points that were already available and had also differences in signatures - e.g.
[2e9, 2e-9, -2e9, -2e-9]
@pytest.mark.parametrize( | ||
"X,n_components,perplexity,expected_shape", | ||
[ | ||
pytest.param( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's only one parameterization, maybe these could be defined inside the function body.
dtype, | ||
): | ||
""" | ||
TSNE test covering multiple functionality and edge cases using parameterization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is redundant given the function name.
# Check for distance b/t two points in group A < distance of this point and any point in group B | ||
for i in group_a_indices: | ||
for j in group_a_indices: | ||
if i != j: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition should not be needed,
0.5, | ||
(10, 2), | ||
False, | ||
id="Extremely low perplexity", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering what this is testing, perhaps it could add a check for no infinites , no NaNs, and no all-zeros in the embeddings in this test too.
X_df = _convert_to_dataframe(X, sycl_queue=queue, target_df=dataframe) | ||
tsne = TSNE(n_components=2, perplexity=2.0).fit(X_df) | ||
assert "daal4py" in tsne.__module__ | ||
assert hasattr(tsne, "n_components"), "TSNE missing 'n_components' attribute." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should already get tested as part of the line that comes next.
tsne = TSNE(n_components=2, perplexity=2.0).fit(X_df) | ||
assert "daal4py" in tsne.__module__ | ||
assert hasattr(tsne, "n_components"), "TSNE missing 'n_components' attribute." | ||
assert tsne.n_components == 2, "TSNE 'n_components' attribute is incorrect." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also check for 'perplexity' as it was passed to the constructor too.
Description
Added additional tests in sklearnex/manifold/tests/test_tsne.py to expand the test coverage for t-SNE algorithm.
PR completeness and readability
Testing