-
Create Python 3 virtual environment.
python3 -m venv myvenv source myvenv/bin/activate
-
Install the required dependencies.
pip install -r requirements.txt
-
Create postgresql DB and run setup the database
psql -U username -d myDataBase -a -f db_setting.sql
-
Edit psql information of web_crawler.conf file
This program requires 3 arguments.
- component name: url_queue / web_downloader / parser / sql_extractor / table_extractor / nl_extractor
- input file type: sql / list
- input file name
1. Component name
- url_queue: It executes crawling using Google Search. You should use filename.
- web_downloader: It executes crawling with the ouputs of the url_queue.
- parser: It parses a html and filters the html with content keywords.
- sql_extractor: It extracts SQL from the output of the parser and filters the html which doesn't have any SQLs. It follows the syntax of sqlite3.
- table_extractor: It extracts tables from the output of the sql_extractor.
- nl_extractor: It extracts tables from htmls which not filtered.
2. Input file type
- sql: you can use sql as an input. (ex. select id from url_info;)
- list: you can use list as an input (ex. In file... 0 1 3 )
3. Input file name
In the file, there is a list or a sql as an input.
-
url_queue
It needs keywords
-
Example for sql:
select content form topic_keywords;
-
Example for file:
SQL select SQL+where+from
-
-
web_downloader, parser, sql_extractor, table_extractor, nl_extractor
It needs ids of url.
-
Example for sql:
select id form url_ids;
-
Example for file:
0 1 3
-
You can test with run.sh script or the following commands
- python web_crawler.py url_queue sql topic_keywords.sql
- python web_crawler.py url_queue list topic_keywords.list
- python web_crawler.py web_downloader sql url_ids.sql
- python web_crawler.py web_downloader list url_ids.list