An API wrapper around the Newspaper Navigator machine learning pipeline for extracting visual content (such as cartoons and advertisements) from newspapers. The original newspaper navigator application can be found here.
This API provides endpoints that accept images. When an image is submitted to the API, it is segmented and visual content (such as cartoons, maps, or advertisements) is extracted. For more information, see the Pipeline
section in this README.
- Clone this repo.
- Download the pre-trained model weights from here.
- Place the model weights in the
src/resources/
folder. - Install [Tesseract] (https://github.com/tesseract-ocr/tessdoc/blob/master/Installation.md) and make sure
tesseract
is on your PATH. - Install PyTorch>=1.7.1.
- Install detectron2.
- Run
pip install -r requirements.txt
to install the required python packages.
Alternatively, you can build a Docker container that contains all of the dependencies (if whatever reason you cannot install the above).
- Clone this repo.
- Download the pre-trained model weights from here.
- Place the model weights in the
src/resources/
folder. - Install [Docker] (https://docs.docker.com/get-docker/).
- Run
docker-compose build
Please note that images, especially large ones, take very long to segment on CPU. It can take a couple of minutes to get a response. Furthermore, the segmentation model requires a large amount of RAM. As such, ensure your system meets the following minimum requirements:
1.CPU: At least 2 cores. 2.RAM: At least 8GB.
If running into RAM or processing time issues, lower the MAX_IMAGE_SIZE
parameter in config.py
to process images at a lower resolution.
If running locally, you can launch the API by running python main.py
If running in Docker, you can launch it by running docker-compose up --build
.
The CLI accepts the following arguments:
1.--port
/ -p
: The port to launch on (default 8008
)
2.--host
: What host to listen on (default 0.0.0.0
)
3.--log-level
/-l
: Minimum log level to use (default info
)
4.--timeout
/-t
: Timeout keep alive in seconds (default 30
)
5.--api_key
/-k
: The API key to use (default None
). If not specified, the API starts without authentication.
For example, you can do python main.py --port 5000 --api_key "abcdef"
to launch on port 5000 with an API key of "abcdef"
- Set
USE_CPU
inconfig.py
toFalse
. - Install the latest Nvidia drivers.
- Install CUDA toolkit.
- Make sure your Torch version supports CUDA.
- If using Docker, install nvidia-container-toolkit.
- If using Docker, do
docker-compose -f docker-compose_GPU.yml up
instead ofdocker-compose up
.
The API has the following endpoints:
/api/segment_formdata
: This endpoint expects the image to be segmented appended as formdata.
/api/segment_url
: This endpoint expects a POST request with JSON body in the format of {"image_url": URL_HERE}
/api/segment_base64
: This endpoint expects a POST request with JSON body in the format of {"image_base64": BASE64_HERE}
All endpoints will return a SegmentationResponse
in the following format:
class SegmentationResponse(BaseModel):
status_code: int
error_message: str
segment_count: Optional[int]
segments: Optional[List[ExtractedSegment]]
where ExtractedSegment is defined by:
class ExtractedSegment(BaseModel):
ocr_text: str
hocr: str
bounding_box: BoundingBox
embedding: List[float]
classification: str
confidence: float
Note: If something goes wrong with your request, the status_code
of the response will be non-zero and the reason will be returned in error_message
. In that case, segment_count
and segments
will be null so make sure to check the status_code
of the response before accessing those fields.
The API uses PyTest for testing. Tests are located in /src/tests
.
You can run the tests by navigating to src
and calling pytest
in your terminal.
Note that this will require you to have already downloaded the model weights.
Images go through a pipeline that can be broken into the following steps:
- The image is given to a pretrained FasterRCNN-based model that returns bounding boxes, classifications, and confidences for visual content.
- All results below a configurable minimum confidence threshold are discarded.
- The segments are cropped out from the original image.
- Each segment goes through OCR using Tesseract.
- Each segment goes through a pretrained Resnet18 model to generate image embeddings (useful for similarity comparison and search by image).
The segmentation model is trained to classify the following visual content:
- Illustration
- Map
- Comics/Cartoons
- Editorial Cartoon
- Headline
- Advertisement
Since the segmentation model takes very long to process images on CPU, it can be cumbersome to test it when integrating it with another application. To make testing easier, the API also has the following three endpoints:
/test/segment_formdata
/test/segment_url
/test/segment_base64
These endpoints behave exactly like their /api/
counterparts (they expect the same data formats as input, and return a SegmentationResponse
). However, these endpoints always return the same response and do so very quickly. When testing your application, you can use these endpoints to get legitimate responses very fast instead of waiting for images to process.
The API uses Tesseract in page segmentation mode 12 to perform OCR. Both the plain text OCR and the location-aware HOCR HTML are included in the segmentation response.
- When submitting an image by URL for segmentation, the API must download the image from that URL. Depending on the server that the image is hosted on, it may reject automated attempts to fetch the image. The following download headers are used to alleviate that in some cases:
FILE_DOWNLOAD_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Apple'
'WebKit/537.36 (KHTML, like Gecko) '
'Chrome/74.0.3729.157 Safari/537.36'}
However, keep in mind that some image URLs will get rejected.