Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] inspectable set-valued domains for distributions #292

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

VascoSch92
Copy link

Reference Issues/PRs

First version fro implementation of #244

What does this implement/fix? Explain your changes.

The PR implements a first version for set-valued domains for distributions.

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

There are no tests for the changes. This is a point in the PR checklist. For instance, this is just a. draft PR to discuss a first approach distributions domains.

Any other comments?

As said, this is a draft PR to start a discussion about domains for distributions.

For instance, the most important domains for distributions that skpro needs are

  • interval of the real line (class Interval)
  • finite sets of the real line (class Finite)
  • direct product of the above domains (class Product)

Each of these class implements the methods:

  • __contains__ which can be used to elegantly check if an element is in the domain of an interval
  • boundary which can be used for integration

I found no other method that can be useful, for now. Perhaps, a couple of example could clarify the situation.

About Infinite sets of point. They can be of two kinds:

  • discrete
  • continuous (if they have an accumulation point, e.g., $ \{ \frac{1}{n} : n \in \mathbb{N} \}$

Which of them do we need? There are examples of distributions which needs these sets?

Finally, I did not implement tags, because of two reasons:

  • I'm not 100% sure how tags works (the are tags of subclasses override tags of parent classes or are added?)
  • I didn't know which tags could be useful in this context.

PR checklist

  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
  • Implement tests for the new classes
  • Document the various classes
  • Add _tags

skpro/domains/domains.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

I have now had time to review and think about this design.
I think it's great, of course still work in progress.

I want to add three things to think about - these are not change requests but thinking points.

  1. How do we use this from BaseDistribution? I was thinking about two attributes - domain and support. The support is the locus of non-zero mass/density, domain is the full range. For now, domain is perhaps redundant, but it will get useful for categorical rv.
    • a technical point, how would this look like? I would use a property and return a clone each time.
  2. an open question that I have no good answer for - how do we deal with the a dataframe/array case of distribution? Suppose we have the Uniform distribution, it has lower and upper interval. But this can be different for each index. So what is support in this case? Some kind of array set? You have Product, but I was wondering whether there should be a more direct support of the array case - building it into the core API rather than by composition, like with BaseDistribution.
  3. I was also thinking about the Lebesgue decomposition, i.e., continuous part plus discrete part. We need to express both domains, somehow, and make them queriable.
  4. finally, this is more of a crazy idea: should we decorate all functions with their range and domain? You could think of getattr(my_distr, "pdf") returning not a method but an object, like loc and iloc in pandas or BaseDistribution. This could have attributes, so doing sth like my_dist.pdf.domain and my_dist.pdf.range will return a set object. What do you think of it? Maybe it is overengineered, I would not do it right away anyway.

@VascoSch92
Copy link
Author

How do we use this from BaseDistribution? I was thinking about two attributes - domain and support. The support is the locus of non-zero mass/density, domain is the full range. For now, domain is perhaps redundant, but it will get useful for categorical rv.
a technical point, how would this look like? I would use a property and return a clone each time.

I think using domain and support is a good idea, and cloning each time is clean.

an open question that I have no good answer for - how do we deal with the a dataframe/array case of distribution? Suppose we have the Uniform distribution, it has lower and upper interval. But this can be different for each index. So what is support in this case? Some kind of array set? You have Product, but I was wondering whether there should be a more direct support of the array case - building it into the core API rather than by composition, like with BaseDistribution.

I don't have an answer for that. What can I suggest is that we start to introducing the domain of a distribution for easy distribution (Normal for example) and then we progress with more challenging one. In this way we can understand what we have to change to serve at best our final scope.

finally, this is more of a crazy idea: should we decorate all functions with their range and domain? You could think of getattr(my_distr, "pdf") returning not a method but an object, like loc and iloc in pandas or BaseDistribution. This could have attributes, so doing sth like my_dist.pdf.domain and my_dist.pdf.range will return a set object. What do you think of it? Maybe it is overengineered, I would not do it right away anyway.

It is an interesting Idea. I'm thinking, probability density function, logarithmic probability function and cumulative density function have aways the same domain right? Therefore, this decorator it is not helpful to define domains for function associate to distributions (perhaps I'm wrong and there is some case where it can be used for that). On the other side, this could be a nice API for the final user :-)

Should I finalise the class and then we can work on a branch where we add the domain/support for distribution already coded? What do you think?

@fkiraly
Copy link
Collaborator

fkiraly commented May 8, 2024

I don't have an answer for that. What can I suggest is that we start to introducing the domain of a distribution for easy distribution (Normal for example) and then we progress with more challenging one. In this way we can understand what we have to change to serve at best our final scope.

Hm, ok. I suppose that works, incremental design is better than overdesign.

It is an interesting Idea. I'm thinking, probability density function, logarithmic probability function and cumulative density function have aways the same domain right?

For the full domain, yes (the reals), but supports for the discrete part would be most interesting here, since that is not directly inspectable right now. I agree though that starting simple would be best.

Should I finalise the class and then we can work on a branch where we add the domain/support for distribution already coded? What do you think?

Agreed, let's see whether this works or whether we hit limitations.

@VascoSch92
Copy link
Author

Hey @fkiraly

I worked a little on the PR request and I think it is ready for a review.

Let me know what do you think.

Some remarks:

  • the way I coded the unit tests can do better.
  • what do you think about merging the module not in main but in another branch, such that we can experimenting? Once we have a stable API, we can merge in main.
  • For instance, just finite, interval and product sets are implemented.

@VascoSch92 VascoSch92 marked this pull request as ready for review May 13, 2024 15:05
@fkiraly
Copy link
Collaborator

fkiraly commented May 14, 2024

Yes, I agree with developing this off main to a point where the functionality is integrated with the BaseDistribution class sufficiently.

I would suggest, let's make a branch on sktime called sets, I will base it off this branch, and we continue making PR to that branch?

Also, it would be nice to have a short write-up of your plans for the next steps, perhaps in the issue #244. I'm happy to halp with that or set up a meeting.

@fkiraly
Copy link
Collaborator

fkiraly commented May 14, 2024

done, there is now a branch called sets on skpro.

@VascoSch92
Copy link
Author

Perfect!

Should I close this PR and checkout the new branch?

I will take sometime to write in #244 what are the plans for the future, such that we can discuss ;-)

@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

Should I close this PR and checkout the new branch?

I would say, let's leave this branch and PR as is, as it "freezes" the current state. Perhaps turn it into a draft.
The new PR will point to a branch that will change, so even if they are now identical they are not redundant in function.

Further, I think the tests are failing due to lack of get_test_params - perhaps you want to fix that to ensure BaseObject API compliance before we move on.

@VascoSch92
Copy link
Author

Should I close this PR and checkout the new branch?

I would say, let's leave this branch and PR as is, as it "freezes" the current state. Perhaps turn it into a draft. The new PR will point to a branch that will change, so even if they are now identical they are not redundant in function.

Further, I think the tests are failing due to lack of get_test_params - perhaps you want to fix that to ensure BaseObject API compliance before we move on.

ok perfect. This PR is draft now and I will open another PR into the other branch to fix the tests.

The tests were passing locally. I should give a look on how you implement the tests in the other modules.

@VascoSch92 VascoSch92 marked this pull request as draft May 15, 2024 18:06
@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

The tests were passing locally. I should give a look on how you implement the tests in the other modules.

What were you trying for this? Have you tried check_estimator on the new objects?

@VascoSch92
Copy link
Author

What were you trying for this? Have you tried check_estimator on the new objects?

I implemented just unit tests. I was just running them using pytest directly. Should I use the machinery developed for the BaseObject?

@fkiraly
Copy link
Collaborator

fkiraly commented May 15, 2024

Should I use the machinery developed for the BaseObject?

It's automatically applied by the CI, and you can mimick that by using check_estimator on the objects. For the specific classes, you'd need implemented get_test_params to pass - or, defaults.

@VascoSch92
Copy link
Author

Should I use the machinery developed for the BaseObject?

It's automatically applied by the CI, and you can mimick that by using check_estimator on the objects. For the specific classes, you'd need implemented get_test_params to pass - or, defaults.

Sorry i really don't get 100% how it works. Do you have an example? I was trying to understand the tests for distribuitions but with not so much success.

@fkiraly
Copy link
Collaborator

fkiraly commented May 16, 2024

Sure! here is some explanation on the test classes:

https://www.sktime.net/en/latest/developer_guide/testing_framework.html#an-illustrative-example

Instances of objects are constructed from the parameter settings returned in get_test_params. If the latter contains no settings, and there are non-default params in __init__, the framework will not know how to construct an instance.

@VascoSch92
Copy link
Author

Is it mandatory to use this topology of testing if we are extending from BaseObject ?

@fkiraly
Copy link
Collaborator

fkiraly commented May 16, 2024

Is it mandatory to use this topology of testing if we are extending from BaseObject?

Yes, in the sense that BaseObject descendants should be BaseObject API compliant, otherwise you lose the composability etc.
The requirement weight from this is typically minimal, such as specifying test parameters and not overriding methods.

@VascoSch92
Copy link
Author

Is it mandatory to use this topology of testing if we are extending from BaseObject?

Yes, in the sense that BaseObject descendants should be BaseObject API compliant, otherwise you lose the composability etc.

The requirement weight from this is typically minimal, such as specifying test parameters and not overriding methods.

Last question: there is a way to trigger the tests in the exact same way they are triggered in the workflow? Such that i can check that everything works before opening a Pr

@fkiraly
Copy link
Collaborator

fkiraly commented May 19, 2024

Last question: there is a way to trigger the tests in the exact same way they are triggered in the workflow? Such that i can check that everything works before opening a Pr

Yes, running pytest on your local repository with no soft dependencies installed should trigger run-tests-no-extras.

Installing skpro[all_extras] and then triggering the tests should have the same effect as run-tests-all-extras.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants