turicas · augusto-herrmann · Jun 4, 2020 · Jun 8, 2020 · Feb 20, 2021 · Feb 20, 2021
diff --git a/README.en.md b/README.en.md
@@ -4,10 +4,11 @@
 
 ![pytest@docker](https://github.com/turicas/covid19-br/workflows/pytest@docker/badge.svg) ![goodtables](https://github.com/turicas/covid19-br/workflows/goodtables/badge.svg)
 
-This repository unifies links and data about reports on the number of cases
-from State Health Secretariats (Secretarias Estaduais de Saúde - SES), about
-the cases of covid19 in Brazil (at each city, daily), amongst other data relevant
-for analysis, such as deaths tolls accounted for in the notary service (by state, daily).
+This repository unifies links and data about reports on the number of
+cases from State Health Secretariats (Secretarias Estaduais de Saúde -
+SES), about the cases of covid19 in Brazil (at each city, daily),
+amongst other data relevant for analysis, such as deaths tolls accounted
+for in the notary service (by state, daily).
 
 ## Table of Contents
 1. [License and Quotations](#license-and-quotations)
@@ -20,49 +21,53 @@ for analysis, such as deaths tolls accounted for in the notary service (by state
 
 ## License and Quotations
 
-The code's license is [LGPL3](https://www.gnu.org/licenses/lgpl-3.0.en.html) and the
-converted data is [Creative Commons Attribution ShareAlike](https://creativecommons.org/licenses/by-sa/4.0/).
-In case you use the data, **mention the original data source and who treated the data**
-and in case you share the data, **use the same license**.
-Example of how the data can be quoted:
-- **Source: Secretarias de Saúde das Unidades Federativas, data treated by Álvaro Justen and a team of volunteers [Brasil.IO](https://brasil.io/)**
-- **Brasil.IO: epidemiological reports of COVID-19 by city daily, available at: https://brasil.io/dataset/covid19/ (last checked in: XX of XX of XXXX, access in XX XX, XXXX).**
+The code's license is
+[LGPL3](https://www.gnu.org/licenses/lgpl-3.0.en.html) and the converted
+data is [Creative Commons Attribution
+ShareAlike](https://creativecommons.org/licenses/by-sa/4.0/). In case
+you use the data, **mention the original data source and who treated the
+data** and in case you share the data, **use the same license**. Example
+of how the data can be quoted:
+- **Source: Secretarias de Saúde das Unidades Federativas, data treated
+  by Álvaro Justen and a team of volunteers
+  [Brasil.IO](https://brasil.io/)**
+- **Brasil.IO: epidemiological reports of COVID-19 by city daily,
+  available at: https://brasil.io/dataset/covid19/ (last checked in: XX
+  of XX of XXXX, access in XX XX, XXXX).**
 
 
 ## Data
 
 The data, after collected and treated, stays available in 3 ways on [Brasil.IO](https://brasil.io/):
 
 - [Web Interface](https://brasil.io/dataset/covid19) (made for humans)
-- [API](https://brasil.io/api/dataset/covid19) (made for humans that develop apps) - [see available API documentation](api.md)
+- [API](https://brasil.io/api/dataset/covid19) (made for humans that
+  develop apps) - [see available API documentation](api.md)
 - [Full dataset download](https://data.brasil.io/dataset/covid19/_meta/list.html)
 
-In case you want to access the data before they are published (ATTENTION: they
-may not have been checked yet), you can [access directly the sheets in which we
-are working on](https://drive.google.com/open?id=1l3tiwrGEcJEV3gxX0yP-VMRNaE1MLfS2).
+In case you want to access the data before they are published
+(ATTENTION: they may not have been checked yet), you can [access
+directly the sheets in which we are working
+on](https://drive.google.com/open?id=1l3tiwrGEcJEV3gxX0yP-VMRNaE1MLfS2).
 
-If this program and/or the resulting data are useful to you or your company,
-**consider [donating to the project Brasil.IO](https://brasil.io/doe)**, which
-is maintained voluntarily.
+If this program and/or the resulting data are useful to you or your
+company, **consider [donating to the project
+Brasil.IO](https://brasil.io/doe)**, which is maintained voluntarily.
 
 
 ### FAQ ABOUT THE DATA
 
-**Before contacting us to ask questions about the data (we're quite busy),
-[CHECK OUR FAQ](faq.md)** (still in Portuguese).
+**Before contacting us to ask questions about the data (we're quite
+busy), [CHECK OUR FAQ](faq.md)** (still in Portuguese).
 
 For more information [see the data collection
 methodology](https://drive.google.com/open?id=1escumcbjS8inzAKvuXOQocMcQ8ZCqbyHU5X5hFrPpn4).
 
-### Clipping
-
-Wanna see which projects and news are using our data? [See the clipping](clipping.md).
-
 ### Analyzing the data
 
 In case you want to analyze our data using SQL, look at the script
-[`analysis.sh`](analysis.sh) (it downloads and transforms CSVs to
-an SQLite database and create indexes and views that make the job easier)
+[`analysis.sh`](analysis.sh) (it downloads and transforms CSVs to an
+SQLite database and create indexes and views that make the job easier)
 and the archives in the folder [`sql/`](sql/).
 
 By default, the script reuses the same files if they have already been
@@ -74,12 +79,12 @@ OpenDataSUS](analises/microdados-vacinacao/README.md) (still in Portuguese).
 
 ### Validating the data
 
-The metadata are described like the *Data Package* and
-*[Table Schema](https://specs.frictionlessdata.io/table-schema/#language)* standards
-of *[Frictionless Data](https://frictionlessdata.io/)*. This means that the data can
-be automatically validated to detect, for example, if the values of a field conform
-with the type defined, if a date is valid, if columns are missing or if there are
-duplicated lines.
+The metadata are described like the *Data Package* and *[Table
+Schema](https://specs.frictionlessdata.io/table-schema/#language)*
+standards of *[Frictionless Data](https://frictionlessdata.io/)*. This
+means that the data can be automatically validated to detect, for
+example, if the values of a field conform with the type defined, if a
+date is valid, if columns are missing or if there are duplicated lines.
 
 To verify, activate the virtual Python environment and after that type:
 
@@ -97,13 +102,17 @@ online through [Goodtables.io](http://goodtables.io/).
 - [Other relevant datasets](datasets-relevantes.md)
 - [Recommendations for the State Secretariats' data release](recomendacoes.md)
 
+The report from the tool *[Good
+Tables](https://github.com/frictionlessdata/goodtables-py)* will
+indicate if there are any inconsistencies. The validation can also be
+done online through [Goodtables.io](http://goodtables.io/).
 
 ## Contributing
 
 You can contribute in many ways:
 
-- Building programs (crawlers/scrapers/spiders) to extract data automatically
-  ([READ THIS BEFORE](#creating-new-scrapers));
+- Building programs (crawlers/scrapers/spiders) to extract data
+  automatically ([READ THIS BEFORE](#creating-new-scrapers));
 - Collecting links for your state reports;
 - Collecting data about cases by city daily;
 - Contacting the State Secretariat from your State, suggesting the
@@ -115,27 +124,52 @@ You can contribute in many ways:
 In order to volunteer, [follow these steps](CONTRIBUTING.md).
 
 Look for your state [in this repository's
-issues](https://github.com/turicas/covid19-br/issues)
-and let's talk through there.
+issues](https://github.com/turicas/covid19-br/issues) and let's talk
+through there.
 
 ### Creating Scrapers
 
-We're changing the way we upload the data to make the job easier for volunteers and to make the process more solid and reliable and, with that, it will be easier to make so that bots can also upload data; that being said, scrapers will help *a lot* in this process. However, when creating a scraper it is important that you follow a few rules:
+We're changing the way we upload the data to make the job easier for
+volunteers and to make the process more solid and reliable and, with
+that, it will be easier to make so that bots can also upload data; that
+being said, scrapers will help *a lot* in this process. However, when
+creating a scraper it is important that you follow a few rules:
 
 - It's **required** that you create it using `scrapy`;
-- **Do Not** use `pandas`, `BeautifulSoap`, `requests` or other unnecessary libraries (the standard Python lib already has lots of useful libs, `scrapy` with XPath is already capable of handling most of the scraping and `rows` is already a dependency of this repository);
-- There must be an easy way to make the scraper collect reports and cases for an specific date (but it should be able to identify which dates the data is available for and to capture several dates too);
-- The parsing method must return (with `yield`) a dictionary with the following keys:
-  - `date`: in the format `"YYYY-MM-DD"`
-  - `state`: the state initials, with 2 characters in caps (must be an attribute from the class of the spider and use `self.state`)
-  - `city` (city name or blank, in case its the state's value, it must be `None`)
-  - `place_type`: use `"city"` when it's municipal data and `"state"` it's from the whole state
-  - `confirmed`: integer, number of cases confirmed (or `None`)
-  - `deaths`: integer, number of deaths on that day (or `None`)
-  - **ATTENTION**:	the scraper must always return a register to the state that *isn't* the sum of values by city (this data should be extracted by a row named "total in state" in the report) - this row will have the column `city` with the value `None` and `place_type` with `state` - this data must come filled as being the sum of all municipal values *in case the report doesn't have the totalling data*
+- **Do Not** use `pandas`, `BeautifulSoup` or other
+  unnecessary libraries (the standard Python lib already has lots of
+  useful libs, `scrapy` with XPath is already capable of handling most
+  of the scraping and `rows` is already a dependency of this
+  repository);
+- Create a file named `web/spiders/spider_xx.py`, where `xx` is the state
+  acronym, in lower case. Create a new class and inherit from the
+  `BaseCovid19Spider` class, from `base.py`. The state acronym, in two upper
+  case characters, must be an attribute of the class and use `self.state`.
+  See the examples that have already been implemented;
+- There must be an easy way to make the scraper collect reports and
+  cases for an specific date (but it should be able to identify which
+  dates the data is available for and to capture several dates too);
+- The data can be read from the tallies by municipality or from individual case
+  microdata. In that latter case, the scraper must tally up itself the
+  municipal numbers;
+- The `parse` method must cal the `self.add_report(date, url)` method, with
+  `date` being the report date and `url` the URL of the information source;
+- For each municipality in the state, call the `self.add_city_case` method with
+  the following parameters:
+  - `city`: name of the municipality
+  - `confirmed`: integer, number of confirmed cases (or `None`)
+  - `death`: integer, number of deaths on that day (or `None`)
+- Read the state totals from the source of information, if available. One
+  should sum up the numbers of each municipality *only if that information
+  is not available in the original source*. Include the total numbers of the
+  state by calling the `self.add_state_case` method. The parameters are the
+  same as the ones in the method used for the municipality, except for the
+  omission of the `city` parameter;
 - When possible, use automated tests;
 
-Right now we don't have much time available for reviews, so **please**, only create a pull request with code of a new scraper if you can fulfill the requirements above.
+Right now we don't have much time available for reviews, so **please**,
+only create a pull request with code of a new scraper if you can fulfill
+the requirements above.
 
 ## Installing
 
@@ -152,6 +186,10 @@ Requires Python 3 (tested in 3.8.2). To set up your environment:
 2. Create a virtualenv (you can use
    [venv](https://docs.python.org/pt-br/3/library/venv.html) for this).
 3. Install the dependencies: `pip install -r requirements-development.txt`
+4. Run the collect script: `./run-spiders.sh`
+5. Run the consolidation script: `./run.sh`
+6. Run the script that starts the scraping service: `./web.sh`
+  - The scrapers will be available through a web interface at the URL http://localhost:5000
 
 ### Docker setup
 
@@ -297,9 +335,9 @@ Run the script:
 `./deploy.sh`
 
 It will collect the data from the sheets (that are linked in
-`data/boletim_url.csv` and `data/caso_url.csv`), add the data to
-the repository, compact them, send them to the server, and execute
-the dataset update command.
+`data/boletim_url.csv` and `data/caso_url.csv`), add the data to the
+repository, compact them, send them to the server, and execute the
+dataset update command.
 
 > Note: the script that automatically downloads and converts data must
 > be executed separately, with the command `python covid19br/run_spider.py`.
diff --git a/README.md b/README.md
@@ -27,8 +27,12 @@ ShareAlike](https://creativecommons.org/licenses/by-sa/4.0/). Caso utilize os
 dados, **cite a fonte original e quem tratou os dados** e caso compartilhe os
 dados, **utilize a mesma licença**.
 Exemplos de como os dados podem ser citados:
-- **Fonte: Secretarias de Saúde das Unidades Federativas, dados tratados por Álvaro Justen e equipe de voluntários [Brasil.IO](https://brasil.io/)**
-- **Brasil.IO: boletins epidemiológicos da COVID-19 por município por dia, disponível em: https://brasil.io/dataset/covid19/ (última atualização pode ser conferida no site).**
+- **Fonte: Secretarias de Saúde das Unidades Federativas, dados tratados
+  por Álvaro Justen e equipe de voluntários
+  [Brasil.IO](https://brasil.io/)**
+- **Brasil.IO: boletins epidemiológicos da COVID-19 por município por
+  dia, disponível em: https://brasil.io/dataset/covid19/ (última
+  atualização pode ser conferida no site).**
 
 
 ## Dados
@@ -37,7 +41,8 @@ Depois de coletados e checados os dados ficam disponíveis de 3 formas no
 [Brasil.IO](https://brasil.io/):
 
 - [Interface Web](https://brasil.io/dataset/covid19) (feita para humanos)
-- [API](https://brasil.io/api/dataset/covid19) (feita para humanos que desenvolvem programas) - [veja a documentação da API](api.md)
+- [API](https://brasil.io/api/dataset/covid19) (feita para humanos que
+  desenvolvem programas) - [veja a documentação da API](api.md)
 - [Download do dataset completo](https://data.brasil.io/dataset/covid19/_meta/list.html)
 
 Caso queira acessar os dados antes de serem publicados (ATENÇÃO: pode ser que
@@ -112,7 +117,8 @@ Você pode ter interesse em ver também:
 
 Você pode contribuir de diversas formas:
 
-- Criando programas (crawlers/scrapers/spiders) para extrair os dados automaticamente ([LEIA ESSE GUIA ANTES](#criando-novos-scrapers));
+- Criando programas (crawlers/scrapers/spiders) para extrair os dados
+  automaticamente ([LEIA ESSE GUIA ANTES](#criando-novos-scrapers));
 - Coletando links para os boletins de seu estado;
 - Coletando dados sobre os casos por município por dia;
 - Entrando em contato com a secretaria estadual de seu estado, sugerindo as
@@ -127,6 +133,48 @@ Procure o seu estado [nas issues desse
 repositório](https://github.com/turicas/covid19-br/issues) e vamos conversar
 por lá.
 
+### Criando Scrapers
+
+Estamos mudando a forma de subida dos dados para facilitar o trabalho
+dos voluntários e deixar o processo mais robusto e confiável e, com
+isso, será mais fácil que robôs possam subir também os dados; dessa
+forma, os scrapers ajudarão *bastante* no processo. Porém, ao criar um
+scraper é importante que você siga algumas regras:
+
+- **Necessário** fazer o scraper usando o `scrapy`;
+- **Não usar** `pandas`, `BeautifulSoup` ou outras bibliotecas
+  desnecessárias (a std lib do Python já tem muita biblioteca útil, o `scrapy`
+  com XPath já dá conta de boa parte das raspagens e `rows` já é uma
+  dependência desse repositório);
+- Criar um arquivo `web/spiders/spider_xx.py`, onde `xx` é a sigla do estado,
+  em minúsculas. Criar uma nova classe e herdar da classe `BaseCovid19Spider`,
+  do `base.py`. A sigla do estado, com 2 caracteres maiúsculos, deve ser um
+  atributo da classe do spider e usar `self.state`. Veja os exemplos já
+  implementados;
+- Deve existir alguma maneira fácil de fazer o scraper coletar os boletins e
+  casos para uma data específica (mas ele deve ser capaz de identificar para
+  quais datas os dados disponíveis e de capturar várias datas também);
+- A leitura pode ser feita a partir de contagens por município ou de microdados
+  de casos individuais. Neste caso, é necessário que o próprio scraper calcule
+  os totais por município;
+- O método `parse` deve chamar o método `self.add_report(date, url)`, sendo
+  `date` a data do relatório e `url` a URL da fonte de informação;
+- Para cada município no estado, chamar o método `self.add_city_case` com os
+  seguintes parâmetros:
+  - `city`: nome do município
+  - `confirmed`: inteiro, número de casos confirmados (ou `None`)
+  - `death`: inteiro, número de óbitos naquele dia (ou `None`)
+- Ler os totais do estado a partir da fonte da informação, se estiver
+  disponível. Deve-se somar os números de cada município *somente se essa
+  informação não estiver disponível na fonte original*. Incluir os números
+  totais no estado chamando o método
+  `self.add_state_case`. Os parâmetros são os mesmos do método usado para o
+  município, exceto pela omissão do parâmetro `city`;
+- Quando possível, use testes automatizados.
+
+Nesse momento não temos muito tempo disponível para revisão, então **por
+favor**, só crie um *pull request* com código de um novo scraper caso
+você possa cumprir os requisitos acima.
 
 ## Instalando
 
@@ -141,11 +189,15 @@ Você pode montar seu ambiente de desenvolvimento utilizando o
 2. Crie um virtualenv (você pode usar
   [venv](https://docs.python.org/pt-br/3/library/venv.html) para isso).
 3. Instale as dependências: `pip install -r requirements-development.txt`
-
+4. Rode o script de coleta: `./run-spiders.sh`
+5. Rode o script de consolidação: `./run.sh`
+6. Rode o script que sobe o serviço de scraping: `./web.sh`
+  - Os scrapers estarão disponíveis por uma interface web a partir do endereço http://localhost:5000
 
 ### Setup com Docker
 
-Se você preferir utilizar o Docker para executar, basta usar os comandos a seguir :
+Se você preferir utilizar o Docker para executar, basta usar os comandos
+a seguir:
 
 ```shell
 make docker-build       # para construir a imagem