Skip to content

Releases: jsvine/pdfplumber

v0.11.5

01 Jan 15:32
Compare
Choose a tag to compare

Added

  • Add --format text options to CLI (in addition to previously-available csv and json) (h/t @brandonrobertz). (#1235)
  • Add raise_unicode_errors: bool parameter to pdfplumber.open() to allow bypassing UnicodeDecodeErrors in annotation-parsing and generate warnings instead (h/t @stolarczyk). (#1195)
  • Add name property to image objects (h/t @djr2015). (#1201)

Fixed

  • Fix PageImage.debug_tablefinder(...) so that its main keyword argument is named the same (table_settings=) as other related Page methods (h/t @stolarczyk). (#1237)

v0.11.4

18 Aug 23:43
Compare
Choose a tag to compare

Fixed

  • Fix one type hint so that it doesn't throw error on Python 3.8 (h/t @andrekeller). (#1184)

v0.11.3

07 Aug 20:34
Compare
Choose a tag to compare

Added

Changed

  • Change default setting pdfplumber.repair(...) passes to Ghostscript's -dPDFSETTINGS parameter, from prepress to default, and make that setting modifiable via .repair(setting=...), where the value is one of "default", "prepress", "printer", or "ebook" (h/t @Laubeee). (#874 + 48cab3f)

Fixed

  • Fix handling of object coordinates when mediabox does not begin at (0,0) (h/t @wodny). (#1181 + 9025c3f + 046bd87)
  • Fix error on getting .annots/.hyperlinks from CroppedPage (due to missing .rotation and .initial_doctop attributes) (h/t @Safrone). (#1171 + e5737d2)
  • Fix problem where Page.crop(...) was not cropping .annots/.hyperlinks (h/t @Safrone). (#1171 + 22494e8)
  • Fix calculation of coordinates for .annots on CroppedPages. (0bbb340 + b16acc3)
  • Dereference structure element attributes (h/t @dhdaines). (#1169 + 3f16180)
  • Fix Page.get_attr(...) so that it fully resolves references before determining whether the attribute's value is None (h/t @zzhangyun + @mkl-public). (#1176 + c20cd3b)

v0.11.2

06 Jul 21:56
Compare
Choose a tag to compare

Added

  • Add extra_attrs parameter to .dedupe_chars(...) to adjust the properties used when deduplicating (h/t @QuentinAndre11). (#1114)

Development Changes

  • Remove testing for Python 3.8, add testing for Python 3.12. (944eaed)
  • Upgrade flake8, pytest, and pytest-cov — and add setuptools and py as explicit dev requirements (for Python 3.12).

v0.11.1

11 Jun 20:36
Compare
Choose a tag to compare

Fixed

  • Fix .open(..., repair=True) subprocess args (to avoid stderr being captured) (70534a7)
  • Fix coordinates of annots on rotated pages (aaa35c9)
  • Fix handling of PDFDocEncoding failures in decode_text(...)(#1147 + 4daf0aa)
  • Add .get_textmap.cache_clear() to page.close() (0a26f05)

v0.11.0

07 Mar 12:57
Compare
Choose a tag to compare

Summary: More control over the {left-to-right, right-to-left, top-to-bottom, bottom-to-top} direction that pdfplumber reads/writes text (many thanks to @afriedman412 for the idea and prototype in #1040), plus upgrading to pdfminer.six's latest release (which provides more detailed paths for curves), and some fixes.

Added

  • Add {line,char}_dir{,rotated,render} params, to provide better support for non–top-to-bottom, left-to-right text (h/t @afriedman412). (850fd45)
  • Add curve["path"] and curve["dash"], thanks to pdfminer.six upgrade (see below). (1820247)

Changed

  • Upgrade pdfminer.six from 20221105 to 20231228. (cd2f768)
  • Change value of in word["direction"] from {1,-1} to {"ltr","rtl","ttb","btt"}. (850fd45)
  • Deprecate vertical_ttb, horizontal_ltr in favor of char_dir and char_dir_rotated.(850fd45)

Fixed

  • Fix layout-caching issue caused by 0bfffc2. (#1097 + efca277)
  • Fix missing ParentTree edge-case. (#1094))

v0.10.4

10 Feb 23:38
Compare
Choose a tag to compare

Added

  • Add x_tolerance_ratio parameter to extract_text and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). (#1041)
  • Add support for PDF 1.3 logical structure via Page.structure_tree (h/t @dhdaines). (#963)
  • Add "gswin64c" as another possible Ghostscript executable in repair.py (h/t @echedey-ls). (#1032)
  • Re-add Page.close() method, have PDF.close() close all pages as well, and improve relevant documentation (h/t @luketudge). (#1042)
  • Add force_mediabox parameter to Page.to_image(...). (#1054)

Fixed

  • Standardize handling of cropbox, fixing various issues with PageImage. (#1054)
  • Fix Page.get_textmap caching to allow for extra_attrs=[...], by preconverting list kwargs to tuples. (#1030)
  • Explicitly close pypdfium2.PdfDocument in get_page_image (h/t @dhdaines). (#1090)
  • In PDFPageAggregatorWithMarkedContent.tag_cur_item, check self.cur_item._objs length before trying to access [-1]. (4f39d03)

v0.10.3

26 Oct 14:08
Compare
Choose a tag to compare

Added

  • Add support for marked-content sequences, represented by mcid and tag attributes on char/rect/line/curve/image objects (h/t @dhdaines). (#961)
  • Add gs_path argument to pdfplumber.open(...) and pdfplumber.repair(...), to allow passing a custom Ghostscript path to be used for repairing. (#953)

Fixed

v0.10.2

29 Jul 19:04
Compare
Choose a tag to compare

Added

  • Add PDF.path: A Path object for PDFs loaded by passing a path (unless repair=True), and None otherwise. (30a52cb + #948)

  • Accept Iterable objects for geometry utils (h/t @dhdaines). (53bee23 + #945)

Changed

Fixed

v0.10.1

19 Jul 19:03
Compare
Choose a tag to compare

A simple release:

Added

  • Add antialias boolean parameter to Page.to_image(...) and associated methods (h/t @cmdlineluser). (7e28931)