Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial or Quickstart? #11

Open
krambox opened this issue Apr 30, 2020 · 9 comments
Open

Tutorial or Quickstart? #11

krambox opened this issue Apr 30, 2020 · 9 comments
Labels
documentation Documentation enhancements/improvements

Comments

@krambox
Copy link

krambox commented Apr 30, 2020

Does anyone know a tutorial to get started or a few more (but simpler examples) for yaml files? I'm having a hard time right now. For example, I want to extract a number from a PDF form in exactly one position.

@stephenbmfj
Copy link
Contributor

If you can share your PDF form, or a redacted/modified version of it, I can write a yaml and add it to the examples.

@alvinets
Copy link

Can help to write a yaml example for below PDF?
Sample_Certificate.pdf
Want to extract 3 fields:

  1. Name (i.e. Alvin Lam)
  2. Course Name (i.e. Get Ready: Club Public Image Committee )
  3. Completion Date (i.e. 5/25/20)

@eabase
Copy link

eabase commented Apr 25, 2021

@stephenbyrne-mfj
I've added some stuff in issue #19. Maybe you can use that, if we can get it working.

---- UPDATED ---

It has now been fixed, so feel free to use it and include those docs and images into your example folders.

@stephenbmfj
Copy link
Contributor

For Sample_Certificate.pdf:

---
extractor: "pdf.itext5"

header:
  default: 200 # Ignore the "Learning and Development" portion
footer:
  default: 500

maxRowDistance: 2

rootRecordType: certificate
recordTypes:
  certificate:
    label: "Certificate"
    valueTypes:
      - name
      - course
      - date

valueTypes:
  name:
    label: "Name"
  course:
    label: "Course Name"
  date:
    label: "Completion Date"
    # strip out the "on " before the date
    replacements:
      -
        pattern: "on\ *(.*)"
        replacement: "$1"

initialState: "INIT"

states:
  INIT:
    transitions:
      -
        condition: certifiesThat
        nextState: certifiesThat

  certifiesThat:
    include: false
    transitions:
      -
        condition: any
        nextState: name

  name:
    transitions:
      -
        condition: hasSuccessfullyCompleted
        nextState: hasSuccessfullyCompleted
      -
        condition: any
        nextState: name

  hasSuccessfullyCompleted:
    include: false
    transitions:
      -
        condition: any
        nextState: course

  course:
    transitions:
      -
        condition: date
        nextState: date
      -
        condition: any
        nextState: course

  date:
    transitions:
      -
        condition: certifiesThat
        nextState: certifiesThat

conditions:

  any: '1 = 1'

  certifiesThat: 'text = "Certifies that"'

  hasSuccessfullyCompleted: 'text = "has successfully completed"'

  date: 'text =~ /on .*/ and fontSize = 14.0'

Generates:

page,Name,Course Name,Completion Date
1,Alvin Lam,Get Ready: Club Public Image Committee,5/25/20

@stephenbmfj
Copy link
Contributor

@krambox This is a good simple example. If you create pull request to add the PDF to src/test/resources/io/mfj/textricator/examples/, we will include it.

I do not want to assume that we can distribute your PDF; you submitting the PR makes the permission more explicit.

@eabase
Copy link

eabase commented May 17, 2021

Speaking of tutorial.

  1. One of the most bewildering things about using this tool, is understanding the FSM code. (I still don't TBH.)
    Can you provide some links to how we can learn the YML for the FSM. Is it general enough for it's use here?

  2. The second most challenging part is making the precise and correct measurements on the original PDF files.
    I Strongly suggest that each example PDF file is supplemented with an image of the very same PDF, but which include the measures drawn in, as I have done in issue Failing to make a simple example with exactly formatted PDF #19. (See below instructions.)

Measuring up your PDF

This is very easy to do in Acrobat Reader DC, if you open additional tools. Then just select Measuring Tool and right-click somewhere on the page. There you have to first select Change Scale Ration and Precision and use 1 pt = 1 pt in the pop-up box. There are some bugs in Adobe that (a) make it's forget these settings, and (b) gets really confused when measuring close to the edge of the document. It take a few tries until you get the hang of it.

AcroRd32_2021-05-17_23-23-45


Maybe you should add the above note and picture to your Wiki, or to the readme, or wherever it can be easily found....

@stephenbmfj
Copy link
Contributor

I usually run textricator text input.pdf input-text.csv and then open input-text.csv in Libreoffice Calc (or whatever your favorite spreadsheet tool is). I just added a section to the readme about this: 97b7b33

@stephenbmfj
Copy link
Contributor

Writing the FSM code is definitely the hardest part. I agree there should be a tutorial (or maybe a video?) that explains how it works and walks through developing one for a simple document.

@eabase
Copy link

eabase commented May 17, 2021

@wstumbo-mfj wstumbo-mfj added the documentation Documentation enhancements/improvements label Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation enhancements/improvements
Projects
None yet
Development

No branches or pull requests

5 participants