To start crawling and extract data from sites you need to create a data
extractor that is capable of handling given resource and plug it into the
crawler engine. Then you can read incoming data from dataBus
.
This extractor reads h1
title from example.com
and pushes it into the
dataBus
as 'website header'
.
const Extractor = require('crawler.newride.tech/Extractor');
const MyExtractor extends Extractor {
canCrawlUrl(url) {
return Promise.resolve(/example.com\/index.html/.test(url));
}
extractFromUrl(urlListDuplexStream, dataBus, url) {
return new Nightmare()
.goto(url)
.evaluate(() => document.querySelector('h1').textContent)
.end()
.then(textContent => dataBus.pushData('website header', textContent))
;
}
}
module.exports = MyExtractor;
To work, crawler needs UrlListDuplexStream
, DataBus
and ExtractorSet
.
Those are basic and necessary objects for the mechanism to work and each one
of them plays its own role in the crawling process.
Using several classess, their concerns are separated and can operate independently. Only combined together they make a Crawler that is able to extract data from various sources efficiently.
Crawler
- orchestrates and manages DataBus
, ExtractorSet
and
UrlListDuplexStream
to work together. It's the only object able to keep
everyting together and start the crawling process.
DataBus
- its sole purpose is to transfer data. Since all crawling happens
asynchronously you can never know when the desired data arrives. DataBus
will
notify you at the exact moment any extractor finds the information you are
waiting for.
For DataBus
by design there is no difference how the data is obtained and
when it's published. Thanks to that it does not enforce any implementation on
Extractors.
ExtractorSet
- each Extractor
must specify what kind of resource it can
handle. ExtractorSet
keeps all the Extractors
together and when asked it
points the Extractors
that can handle the given resource (website for
example).
UrlListDuplexStream
- keeps track of the links that are going to be crawled.
If you start crawling with empty UrlListDuplexStream
then Crawler
will stop
immediately since it has nothing to do.
UrlListDuplexStream
is a FIFO queueu that feeds links to the Crawler
until
it's empty. You can push more links into UrlListDuplexStream
while crawling.
Thanks to that you can crawl resources list using one extractor and then push
the links to individual resources into UrlListDuplexStream
so other
Extractors
can crawl them.
Crawling stops when UrlListDuplexStream
is empty.
const Crawler = require('crawler.newride.tech/Crawler');
const DataBus = require('crawler.newride.tech/EventEmitter/DataBus');
const ExtractorSet = require('crawler.newride.tech/ExtractorSet');
const UrlListDuplexStream = require('crawler.newride.tech/UrlListDuplexStream');
// your extractor
const MyExtractor = require('./MyExtractor');
const dataBus = new DataBus();
dataBus.addListener('website header', datagram => {
console.log(datagram.data); // 'Example Domain'
datagram.resolve();
});
const crawler = new Crawler(dataBus, new ExtractorSet([
new MyExtractor()
]));
const urlListDuplexStream = new UrlListDuplexStream();
urlListDuplexStream.feed([
'http://example.com'
]);
crawler.run(urlListDuplexStream);