Does this tool support extraction of data from complex PDF structure which contains incomplete boxes? #62

ZarvisD · 2024-10-22T12:57:45Z

I am trying to create yaml for this file. Below is the yaml structure I have created. I want to get the Full Name, and refernce code. I am able to get reference code but not the Full name.

# Use the pdfbox parser, since it's the same one we used to originally etract the text to build this planning document.
extractor: "pdf.pdfbox"

# All measurements are in points. 1 point = 1/72 of an inch.
# x-coordinates are from the left edge of the page.
# y-coordinates are from the top edge of the page.
header:
    # ignore anything less than this many points from the top, default and per-page
  default: 690
footer:
    # ignore anything less than this many points from the bottom, default and per-page
  default: 7160

# Text segments are generally parsed in order, top to bottom, left to right.
# If two text segments have y-coordinates within this many points, consider them on the same line,
# and process the one further left first, even if it is 0.4pt lower on the page.
maxRowDistance: 4

# Define the output data record.
# Since the main record type we're collecting information on is our employees,
# we'll have that be the root type for our harvested information.
rootRecordType: RAF
recordTypes:
  RAF:
    label: "RAF" # Labels are used when nested recordTypes come into play, like this document.
    valueTypes:
      # Not sure what to name a valueType? Just make something up!
      - URC
      - Name

valueTypes:
  URC:
    # In the CSV, use "Employee ID" as the column header instead of "employee".
    label: "Unique Reference Code"
  Name:
    label: "Full Name"

# Now we define the finite-state machine
# Let's name the state that our machine starts off with:
initialState: "INIT"

# When each text segment is encountered, each transition for the current state is checked.
states:
  INIT:
    include: false
    transitions:
      - condition: URC
        nextState: URC

      - condition: any
        nextState: INIT

  URC:
    startRecord: true
    transitions:
      - condition: any  
        nextState: Name

  Name:
    include: true
    transitions:
      - condition: Name
        nextState: Name

      - conidtion: any
        nextState: INIT


# Here we define the conditions:
conditions:

  # An example of comparing text with regex.
  # In this case, we're making sure that the text contains the characters 'ID-' followed by any amount of numbers.
  URC: 'text =~ /\b[a-f0-9]{32}\b/'

  Name: 'text =~ /^[A-Z][a-z]+(?: [A-Z][a-z]+)* [A-Z][a-z]+$/'

  # Need a condition that is always true? "1=1" does that for you.
  any: "1 = 1"

The text was updated successfully, but these errors were encountered:

wstumbo-mfj · 2024-10-23T15:00:41Z

I'm trying to map the image you supplied into your yml file but with some of the labels blurred I'm having a difficult time. Can you supply me with a blank copy of the form? With that, I should be able to get a better understanding of what you're trying to do and what is actually happening. I'll then try and give you some reasonable insight in to how to proceed.

wstumbo-mfj self-assigned this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this tool support extraction of data from complex PDF structure which contains incomplete boxes? #62

Does this tool support extraction of data from complex PDF structure which contains incomplete boxes? #62

ZarvisD commented Oct 22, 2024

wstumbo-mfj commented Oct 23, 2024

Does this tool support extraction of data from complex PDF structure which contains incomplete boxes? #62

Does this tool support extraction of data from complex PDF structure which contains incomplete boxes? #62

Comments

ZarvisD commented Oct 22, 2024

wstumbo-mfj commented Oct 23, 2024