Skip to content

Latest commit

 

History

History
89 lines (54 loc) · 2.17 KB

README.md

File metadata and controls

89 lines (54 loc) · 2.17 KB

Web-Crawler-for-NL2SQL

Environment

  1. Create Python 3 virtual environment.

    python3 -m venv myvenv
    source myvenv/bin/activate
    
  2. Install the required dependencies.

    pip install -r requirements.txt
    
  3. Create postgresql DB and run setup the database

    psql -U username -d myDataBase -a -f db_setting.sql
    
  4. Edit psql information of web_crawler.conf file

How to run

This program requires 3 arguments.

  1. component name: url_queue / web_downloader / parser / sql_extractor / table_extractor / nl_extractor
  2. input file type: sql / list
  3. input file name

1. Component name

  1. url_queue: It executes crawling using Google Search. You should use filename.
  2. web_downloader: It executes crawling with the ouputs of the url_queue.
  3. parser: It parses a html and filters the html with content keywords.
  4. sql_extractor: It extracts SQL from the output of the parser and filters the html which doesn't have any SQLs. It follows the syntax of sqlite3.
  5. table_extractor: It extracts tables from the output of the sql_extractor.
  6. nl_extractor: It extracts tables from htmls which not filtered.

2. Input file type

  1. sql: you can use sql as an input. (ex. select id from url_info;)
  2. list: you can use list as an input (ex. In file... 0 1 3 )

3. Input file name

In the file, there is a list or a sql as an input.

  1. url_queue

    It needs keywords

    • Example for sql:

        select content form topic_keywords;
      
    • Example for file:

        SQL
        select
        SQL+where+from
      
  2. web_downloader, parser, sql_extractor, table_extractor, nl_extractor

    It needs ids of url.

    • Example for sql:

        select id form url_ids;
      
    • Example for file:

        0
        1
        3
      

Test

You can test with run.sh script or the following commands

  1. python web_crawler.py url_queue sql topic_keywords.sql
  2. python web_crawler.py url_queue list topic_keywords.list
  3. python web_crawler.py web_downloader sql url_ids.sql
  4. python web_crawler.py web_downloader list url_ids.list