Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiplatform #209

Open
wants to merge 29 commits into
base: master
Choose a base branch
from
Open

Multiplatform #209

wants to merge 29 commits into from

Conversation

McDjuady
Copy link

@McDjuady McDjuady commented Oct 19, 2022

I've been working over the last few months to create a multiplatform version for skrape.it
It's somewhat in line with #196 but I'm a bit further along in some areas, which is why i wanted to get this out.
So far I've converted the buildscripts to multiplatform and implemented some modules in JS.
This is still pretty much WIP and I intend to keep working on it. I'll update the pull request as I get further along and improve the code

What's done so far:

  • Converted buildscripts to multiplatform
  • Added and implemented JS-Targets for the following modules:
    • :dsl
    • :fechter:base-fetcher
    • :html-parser
    • :test-utils
  • Converted the multiplatform modules to use the robstoll/atrium test framework

What still needs to be done:

  • Convert the rest of the modules
  • Decide what to do with the different fetchers. Are they really necessary?
  • Fixup the kover reports (Should be pretty much the same as Initial Kotlin Multiplatform setup #196)
  • Cleanup the code and document it
  • Migrate the JS Target to the new IR compiler (waiting on atrium for that)

Other notable changes:

  • Kotlin version was bumped to 1.7.10 and a few other dependecies were updated
  • Disabled build caching as well as RepositoriesMode.PREFER_SETTINGS since those can unfortunately mess up the builds in multiplatform

McDjuady added 11 commits June 29, 2022 13:41
Some tasks are still not entirely the same and will be updated later
A few tests still need fixing up caused by formatting issues
Using JSDOM for emulation. Since this option is rather slow, it will only be used when jsExecution is requested
In the browser the default implementation via DOMParser will be used
In node linkedom will be used, since it's implementation is more resource friendly than JSDOM and perfectly suitable for web scrapping

Remaining issues are formatting related and will require a custom formatter to match the output of Jsoup
All tests are now passing on the JS side
# Conflicts:
#	assertions/build.gradle.kts
#	build.gradle.kts
#	buildSrc/src/main/kotlin/Deps.kt
#	dsl/build.gradle.kts
#	examples/android/app/build.gradle
#	examples/scraping/build.gradle.kts
#	examples/use-pre-release-version/build.gradle.kts
#	fetcher/async-fetcher/build.gradle.kts
#	fetcher/async-fetcher/src/jvmMain/kotlin/it/skrape/fetcher/AsyncFetcher.kt
#	fetcher/base-fetcher/build.gradle.kts
#	fetcher/browser-fetcher/build.gradle.kts
#	fetcher/browser-fetcher/src/jvmMain/kotlin/it/skrape/fetcher/logging/LoggingConfigurator.kt
#	fetcher/browser-fetcher/src/jvmMain/resources/META-INF/services/ch.qos.logback.classic.spi.Configurator
#	fetcher/browser-fetcher/src/jvmMain/resources/logback.xml
#	fetcher/http-fetcher/build.gradle.kts
#	fetcher/http-fetcher/src/jvmMain/kotlin/it/skrape/fetcher/HttpFetcher.kt
#	gradle.properties
#	html-parser/build.gradle.kts
#	html-parser/src/commonMain/kotlin/it/skrape/selects/DomTreeElement.kt
#	integrationtests/build.gradle.kts
#	ktor-extension/build.gradle.kts
#	mock-mvc-extension/build.gradle.kts
#	settings.gradle.kts
#	test-utils/build.gradle.kts
Upgraded language version and apiTarget to 1.7
Introduced karma middleware to allow test containers to be spun up from the browser
Reworked wiremock stubs to use JSON api and ktor client for multiplatform compatibility
Temporarily allow TestContainers to be run on Windows for testing (and while we figure out how to block them on certain platforms)
NodeJS is now working using testcontainers directly
Block testcontainers from being bundled with webpack
Improved robustness of karma testcontainer middleware
Expose ROOT_PROJECT_PATH to NodeJS tests
…ultiplatform

� Conflicts:
�	fetcher/browser-fetcher/build.gradle.kts
@christian-draeger
Copy link
Collaborator

Wow this is great news 🤩
Really looking forward to this.

Let me know if there are any questions that can maybe help to understand internals of the Lib if needed.

I personally have no experience so far with creating multiplattform Libs in kotlin. In my daily life I am very much into using kotlin on the jvm/server-side.
But I think skrape{it} would be very benificial in the kotlin js world 🙂

I am really glad if skrape{it} could become a multiplatform Lib 🥳

@McDjuady
Copy link
Author

My main question right now would be if multiple fetchers are really necessary.
To keep things simple I'd like to use ktor as a client for any fetch operations. However this implies, that fetching can only be done via a suspend function on non JVM platforms, since JVM is the only target to support runBlocking. This would remove HttpFetcher since it would become redundant.
JsExecution is a feature of http-parser, which falls back onto BrowserFetcher on JVM, in JS it's rendered directly using JSDOM. It would make sense to move the functionality for JVM to http-praser as well, which would deprecate BrowserFetcher, leaving only AsyncFetcher, which could then be migrated into Base Fetcher to leave only one fetcher implementation.
My Plan would be to create on AsyncFetcher supporting jsExecution and with JVM specific extension functions providing blocking calls if needed.

I geuss the question is if that is okay with you and if that's a direction you feel comfortable with?

@christian-draeger
Copy link
Collaborator

Hey, sorry for late response.

My main question right now would be if multiple fetchers are really necessary.

the different fetchers (beside BrowserFetcher which has a little more special purpose) only exists because of convenience for the users to either run blocking or non-blocking.

To keep things simple I'd like to use ktor as a client for any fetch operations. However this implies, that fetching can only be done via a suspend function on non JVM platforms, since JVM is the only target to support runBlocking. This would remove HttpFetcher since it would become redundant.

if this could be simplified and as you mentioned "fetching can only be done via a suspend function on non JVM platforms" anyway i would apprechiate simplification (reducing it to 1) of the fetchers a lot as long as we provide an intuitive way to still make blocking fetches on the JVM possible. either by at least documenting that wrapping calls in a runBlocking scope is necessary or preferred provide something like a skrapeBlocking method to the DSL.

just a quick idea somewhat inspired by what kohttp dsl is doing here, but i am open for other ideas as well:
Synchronous calls: https://kohttp.gitbook.io/docs/core/synchronous-calls/get
Asynchronous calls: https://kohttp.gitbook.io/docs/core/asynchronous-calls/async-get

To keep things simple I'd like to use ktor as a client for any fetch operations

i am totally fine to switch to ktor client completely as a build-in/default as long if we still provide a way for users to implement custom fetchers to keep the lib flexible and not completely dependent on ktor. i am not deep into ktor client, if it is possible to implement custom ktor clients anyway that would be good enough.

JsExecution is a feature of http-parser, which falls back onto BrowserFetcher on JVM, in JS it's rendered directly using JSDOM. It would make sense to move the functionality for JVM to http-praser as well, which would deprecate BrowserFetcher, leaving only AsyncFetcher, which could then be migrated into Base Fetcher to leave only one fetcher implementation.

That sounds resonable to me and would again simplify the code base a lot =) i like!

Last but not least the resaon why Browserfetcher is an own fetcher but also part of the html-parser

Browserfetcher uses html-unit to act as a JVM based headless browser in order to be able to render pages that include JS correctly. it makes the actual request using html-unit as well as rendering the response in this embedded headless browser. That in turn means that if you would fetch a page that includes conditional rendering on some really browser specific things as request headers like authorization, set-cookie or user-agent etc just works as you would expect using a "real" browser.
BrowserFetcher.render() that is currently used in the html-parser on the other hand just renders a html string without request information available.
calling WebClient.withOptions(request: Request) extension function on the com.gargoylesoftware.htmlunit.WebClient inside of BrowserFetchers .render() function could probably fix align the behaviour of BrowserFetcher.render and BrowserFetcher.renderparse.
But since skrape{it} supports parsing html from string you don't have a request when just parsing a html string.
If that can be reworked somehow to maybe add the request options to html-parsers jsExecution mode if available that should be fine.
on he other hand i have to admit that it is only there for a super specific use-case that is maybe not worth to maintain.
from that point of view i would be fine with a simplistic solution that drops BrowserFetcher and let html-parser render html with or without js execution :) If we could still support adding request data to that rendering this would be nice but not a must. Is the described point JVM related only or also a deal with JSDOM?

I geuss the question is if that is okay with you and if that's a direction you feel comfortable with?

overall absolutely =)

@McDjuady
Copy link
Author

Thanks for clearing all that up!
I think most of what you are describing should be possible using just the Ktor client. The hardest part will probably be to get both Ktor and HtmlUnit to use the same client for requesting any additional resources that might come up when running JS on a website (I'll probably write some wrapper classes so HtmlUnit uses the ktor client).
As for JSDOM it's pretty similar. With a few wrappers it'll probably be possible to have a single client to make all the requests.

It might even be possible to replace Skrape.it's own Request class with the Ktor Request builder, since they both serve the same purpose.

Could you maybe provide some tests in which the BrowserFetcher specific features are used so I can make sure it still fulfills those requirements?

The fix is more of a hack that redirects any https request to the http url. Since this is only used for testing and ssl validation is always relaxed it should be fine
@codecov
Copy link

codecov bot commented Oct 23, 2022

Codecov Report

Merging #209 (53ccb08) into master (956e21d) will decrease coverage by 83.18%.
The diff coverage is 0.00%.

❗ Current head 53ccb08 differs from pull request most recent head 90ae18d. Consider uploading reports for the commit 90ae18d to get more accurate results

@@            Coverage Diff             @@
##           master    #209       +/-   ##
==========================================
- Coverage   83.18%   0.00%   -83.17%     
==========================================
  Files          39      40        +1     
  Lines        1141    1173       +32     
  Branches      173     145       -28     
==========================================
- Hits          949       0      -949     
- Misses        131    1173     +1042     
+ Partials       61       0       -61     
Impacted Files Coverage Δ
...commonMain/kotlin/it/skrape/matchers/Assertions.kt 0.00% <ø> (ø)
...in/it/skrape/matchers/ContentTypeHeaderMatchers.kt 0.00% <0.00%> (ø)
...c/commonMain/kotlin/it/skrape/matchers/Matchers.kt 0.00% <ø> (ø)
...onMain/kotlin/it/skrape/matchers/StatusMatchers.kt 0.00% <0.00%> (ø)
...c/jvmMain/kotlin/it/skrape/fetcher/AsyncFetcher.kt 0.00% <0.00%> (ø)
...src/jvmMain/kotlin/it/skrape/fetcher/extensions.kt 0.00% <0.00%> (ø)
...monMain/kotlin/it/skrape/fetcher/Authentication.kt 0.00% <0.00%> (ø)
.../src/commonMain/kotlin/it/skrape/fetcher/Cookie.kt 0.00% <0.00%> (ø)
...src/commonMain/kotlin/it/skrape/fetcher/Request.kt 0.00% <0.00%> (ø)
.../src/commonMain/kotlin/it/skrape/fetcher/Result.kt 0.00% <ø> (ø)
... and 66 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@christian-draeger
Copy link
Collaborator

christian-draeger commented Oct 25, 2022

Hey,
Again, sorry for the late reply.
We are currently working towards a deadline at my job and this means that I unfortunately don't have as much time for open source as I would like to have 🥲

But anyway I took a week off in mid November and will try to provide some tests with the browserfetcher specific features :)

With a few wrappers it'll probably be possible to have a single client to make all the requests.

That sounds absolutely fantastic 🙂

It might even be possible to replace Skrape.it's own Request class with the Ktor Request builder, since they both serve the same purpose.

Tbh I like the look and feel of skrape{it}s request dsl more than ktors. But we could map skrape{it} request dsl to ktor request object internally

Test coverage for JVM works again
JS test coverage is not included and the report seems to not work anyway
For now it only affects downgrading when following redirects. sslRelaxed might not be possible on a per-request basis and may need to be a property of the scraper itself
Deprecated old Scraper methods. Still need to be mapped to the new KtorScraper
JVM works as before except a proxy setting for the request. Will need further testing if it's possible on a per-request basis
JS is pretty difficult, especially in the browser. Some features like disabling redirects and proxies won't be supported.
Replaced Atrium with Kotest-Assertions in preparation for a move to the Kotest Framework for more fine-control over test execution
Switched JS builds to use the IR-Compiler
Updated kotlin version to 1.7.21
Modified test to match new Request style
Switched to Kotest
Improved resources for multiplatform tests
Fixed out of spec behaviour in css selector generation where attributes where enclosed in quotes
Some tests had to be disabled on JS for causing issues
JS Web will require proxy support to be truly usable
@McDjuady
Copy link
Author

McDjuady commented Jan 7, 2023

Hi!

So I've been working along on the multiplatform/JS implementation of skrape.it. It's fully functional on JS now and I'd say I'm quite close to completion. The major missing feature right now is Proxy support on JS.

Because of some of the (breaking) changes I've made this would probably be better suited for a major realease. Here is a (non inclusive) list of what I've changed since the last time:

Breaking Changes

  • Removed multiple fetchers. One single fetcher using Ktor is provided. For compatibility purposes I've kept BrowserFetcher,AsyncFetcher etc. as a configuration variable meaning you can simply fix the imports and it works the same way as it did before
  • Changed Request to use ktors own HttpRequest builder. Extension functions and variables are provided to make them as close as possible to the old request. However, means url is now no longer an assignment but rather a function call and proxy is (not yet) supported. Also the HttpMethod now also uses the Ktor provided implementation

Notable Changes

  • JS is now supported and working
  • Testing framework was changed to Kotest to allow for multipaltform support and better control of test execution. This allows for test to be written in the common code and not be executed on certain platforms or only if docker is present
  • JSExecution can now be controlled on a per request basis and the capability was moved to html-parser

Known issues and limitations

  • Proxy support is missing
  • JS does not support sslRelaxed yet. This should be possible on node, in the web however a proxy will be needed
  • Web will require a proxy to be useful since most scraping atempts will otherwise be denied by CORS
  • Manually handling redirects is not possible when using browser fetch.

TODOs

  • Proxy support
  • Documentation
  • Potentially use more of the built in Types from Ktor (ContentType, Response, etc) instead of wrapping or converting them

If you have the time I'd be very happy if you could look the changes over and give me some feedback if you are happy with the direction this is going.
Also there seems to be an issue with the skrape.it documentation site, where it's impossible to fetch it from JS. Cloudflare just returns 500 with an error page. You can try this yourself with postman or simply try to fetch the page in node. Maybe you could look into that since there might be some information in the server logs as to why it's happening.

Cheers
Max

@christian-draeger
Copy link
Collaborator

christian-draeger commented Jan 12, 2023

hey @McDjuady this is such awesome news. a really really big thank you for all the your effort so far. you rock!

first of all don't worry because of breaking changes. we will release a new major version :)
if smooth migration is possible thats good, if not also ok as long as we write good docs.

i will have a look at the code at latest until monday. job is much more chilled again currently^^

PS: skrape.it docs are hosted via gitbook.com, can not see or do much over there since its SaaS. side-note: i thought about having a better documentation site. i played around with kotlin-playground) in the past. the idea was to have kind of an interactive docu like the official kotlin docs where you can try out code snippets directly. i wanted to work on that again soon, therefore i wouldn't like to invest to much time in the old docs setup / want only add documentation there.

@christian-draeger
Copy link
Collaborator

again sorry for late response. i just reviewed what you did so far. looks good to me 👍

@russellbanks
Copy link

Any update on this?

@christian-draeger
Copy link
Collaborator

Hey @McDjuady,
I know it's been a while. But still I want to kindly ask if you would be open to finish the PR? 🙂

@Tgo1014
Copy link

Tgo1014 commented Oct 18, 2024

Is this dead? 😢

@christian-draeger
Copy link
Collaborator

christian-draeger commented Nov 21, 2024

at least for this PR - it looks like.
i personally running out of time for maintenance due to job and family time.
nevertheless i will try to find some days over the x-mas days to update the projekt again.
maybe, i will give a try for multiplatform again depending on time.

any help help or pull requests welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants