Vitor Baptista - Code analysis: How the Diario Oficial project extracts data from gazettes' PDFs

The Diário Oficial is a project by the Serenata de Amor Operation to extract the government purchases that had bidding exemption (because they are below a certain value) published in the official gazettes. Their intent is to get this data in a machine readable format, so we can look for suspicious purchases. In this post, we'll walk through the code to understand how the gazette is parsed and its bidding exemptions are extracted.

Ther code is available on GitHub. The application is split in two main areas: web and processing. The web, written in NodeJS, is responsible for the API and website to visualize the data. The processing component deals with scraping the gazettes, extracting their data, and saving it to the database. We'll focus on the processing component.

The processing component uses Celery to run its tasks periodically. The tasks are defined in diario-oficial/processing/tasks.py. It defines two periodic tasks: parse_sections and run_spiders. The run_spiders runs every day at 13:00 UTC, finding the gazette's PDF and saving its text into the database (using PdfParsingPipeline). The parse_sections then go over these records, extracting the data from their text.

In this post, we'll look into the parse_sections task, walking through the process of extracting the data of a gazette from Porto Alegre.

The initial state

We're looking into how the parse_sections task behaves when there's already a gazette PDF's text in the database (extracted by the run_spiders task). We'll use the gazette 4.955 from Porto Alegre in 2/March/2015, extracting its bidding exemptions. This is how they look like in the source PDF:

Page in Porto Alegre's gazette with bidding exemptions

We also have it in textual form extracted from this PDF. In the end, the data in each bidding exemption (i.e. "dispensa de licitação") section will be parsed and saved into our database.

Step 1: Extract the sections with bidding exemptions

The parse_sections task is defined as:

@app.task
def parse_sections():
    # Instantiate object that will update the Gazette model
    row_update = RowUpdate(Gazette)
    # Run the parsing using SectionParsing and update the model
    row_update(SectionParsing)
    # Schedule parsing of the bidding exemptions' text
    parse_bidding_exemptions.delay()

The RowUpdate class abstracts updating rows in the database. It's instantiated by passing a model (Gazette in this case), and then called with an "executor" class, SectionParsing. Let's take a look:

class SectionParsing:
    def __init__(self, session):
        self.session = session

    def condition(self):
        return 'is_parsed = FALSE'

    def update(self, gazettes):
        for gazette in gazettes:
            territory = PARSABLE_TERRITORIES.get(gazette.territory_id)
            if territory:
                parsing_cls = getattr(locations, territory)
                parser = parsing_cls(gazette.source_text)
                self.update_bidding_exemptions(gazette, parser)
                gazette.is_parsed = True

    def update_bidding_exemptions(self, gazette, parser):
        parsed_exemptions = parser.bidding_exemptions()
        if parsed_exemptions:
            for record in gazette.bidding_exemptions:
                self.session.delete(record)
            for attributes in parsed_exemptions:
                record = BiddingExemption(**attributes)
                record.date = gazette.date
                gazette.bidding_exemptions.append(record)

This class manages filtering the gazettes to be updated, replacing their existing bidding exemptions (if any) with the ones just parsed, and then marking the gazette by setting is_parsed = True, so it won't be parsed again.

The only gazettes that will be parsed are the ones that:

Weren't parsed yet (i.e. is_parsed == False), and;
We have a parser for its territory_id (defined in the PARSABLE_TERRITORIES dictionary)

The parsers themselves are defined in diario-oficial/processing/gazette/locations. Let's check the RsPortoAlegre parser:

class RsPortoAlegre(BaseParser):
    def bidding_exemptions(self):
        items = []
        for section in self.bidding_exemption_sections():
            items.append(
                {'data': self.bidding_exemption(section), 'source_text': section}
            )
        return items

    # other methods omitted for brevity...

This class goes over the PDF's text, looking for the bidding exemption sections, and returns a list with their data as:

{
    'data': {
        'CONTRATANTE': 'Município de Porto Alegre.',
        'CONTRATADO': 'Classul Indústria e Comércio de Placas e Brindes Ltda.',
        'OBJETO': 'Confecção de 50 medalhas Cidade de Porto Alegre.',
        'VALOR': 'R$ 5.535,00.',
        'DOTAÇÃO ORÇAMENTÁRIA': '201-2524-339031050000-1',
        'BASE LEGAL': 'Artigo 24, inciso II, da Lei Federal 8.666/93.',
    },
    'source_text': '...'  # The original text
}

This is saved into the gazettes.bidding_exemptions attribute.

Step 2: Parse the bidding exemptions

Notice that all attributes in the the bidding exemptions' data are strings. For example, the "VALOR" attribute is "R$ 5.535,00.", instead of a number 5535. In this step, we'll clean and parse these values into their specific data types via the parse_bidding_exemptions task.

@app.task
def parse_bidding_exemptions():
    row_update = RowUpdate(BiddingExemption)
    row_update(BiddingExemptionParsing)

Here we have RowUpdate, as in the parse_sections task, but this time we're updating the BiddingExemption model using the BiddingExemptionParsing class. Let's see how it looks like:

class BiddingExemptionParsing:
    def condition(self):
        return 'is_parsed = FALSE'

    def update(self, records):
        for record in records:
            territory = PARSABLE_TERRITORIES.get(record.gazette.territory_id)
            if territory:
                self.update_object(record)
                self.update_value(record)
                self.update_contracted(record)
                self.update_contracted_code(record)
                record.is_parsed = True

    # other methods omitted for brevity...

It is similar to the SectionParsing we saw in the last step. It also uses the condition as is_parsed = False, but instead of getting the gazettes, it gets each of their bidding exemptions. It loops over each of the exemptions, and if they are from a territory we have a parser for (e.g. Porto Alegre), it'll parse its data.

The parsing is straightforward. For example, the update_value() method simply turns values like "R$ 5.312,94" into the number 5312.94.

Notice that, unlike the SectionParsing class, the parsing is implemented directly in BiddingExemptionParsing. This means that the same code is used by all territories (currently only Goiânia and Porto Alegre). This code will probably need to change in the future, as more and more territories are added, each with their own differences. In the meantime, it's a good example of not adding complexity before you actually need to.

After this code finishes, the database will contain the data properly cleaned and parsed in their respective data types (e.g. numbers instead of strings), which can then be displayed via the web interface and API.