Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this tool support extraction of data from complex PDF structure which contains incomplete boxes? #62

Open
ZarvisD opened this issue Oct 22, 2024 · 1 comment
Assignees

Comments

@ZarvisD
Copy link

ZarvisD commented Oct 22, 2024

image

I am trying to create yaml for this file. Below is the yaml structure I have created. I want to get the Full Name, and refernce code. I am able to get reference code but not the Full name.

# Use the pdfbox parser, since it's the same one we used to originally etract the text to build this planning document.
extractor: "pdf.pdfbox"

# All measurements are in points. 1 point = 1/72 of an inch.
# x-coordinates are from the left edge of the page.
# y-coordinates are from the top edge of the page.
header:
    # ignore anything less than this many points from the top, default and per-page
  default: 690
footer:
    # ignore anything less than this many points from the bottom, default and per-page
  default: 7160

# Text segments are generally parsed in order, top to bottom, left to right.
# If two text segments have y-coordinates within this many points, consider them on the same line,
# and process the one further left first, even if it is 0.4pt lower on the page.
maxRowDistance: 4

# Define the output data record.
# Since the main record type we're collecting information on is our employees,
# we'll have that be the root type for our harvested information.
rootRecordType: RAF
recordTypes:
  RAF:
    label: "RAF" # Labels are used when nested recordTypes come into play, like this document.
    valueTypes:
      # Not sure what to name a valueType? Just make something up!
      - URC
      - Name

valueTypes:
  URC:
    # In the CSV, use "Employee ID" as the column header instead of "employee".
    label: "Unique Reference Code"
  Name:
    label: "Full Name"

# Now we define the finite-state machine
# Let's name the state that our machine starts off with:
initialState: "INIT"

# When each text segment is encountered, each transition for the current state is checked.
states:
  INIT:
    include: false
    transitions:
      - condition: URC
        nextState: URC

      - condition: any
        nextState: INIT

  URC:
    startRecord: true
    transitions:
      - condition: any  
        nextState: Name

  Name:
    include: true
    transitions:
      - condition: Name
        nextState: Name

      - conidtion: any
        nextState: INIT


# Here we define the conditions:
conditions:

  # An example of comparing text with regex.
  # In this case, we're making sure that the text contains the characters 'ID-' followed by any amount of numbers.
  URC: 'text =~ /\b[a-f0-9]{32}\b/'

  Name: 'text =~ /^[A-Z][a-z]+(?: [A-Z][a-z]+)* [A-Z][a-z]+$/'

  # Need a condition that is always true? "1=1" does that for you.
  any: "1 = 1"
@wstumbo-mfj
Copy link
Contributor

I'm trying to map the image you supplied into your yml file but with some of the labels blurred I'm having a difficult time. Can you supply me with a blank copy of the form? With that, I should be able to get a better understanding of what you're trying to do and what is actually happening. I'll then try and give you some reasonable insight in to how to proceed.

@wstumbo-mfj wstumbo-mfj self-assigned this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants