Releases: jsvine/pdfplumber
Releases · jsvine/pdfplumber
v0.11.5
Added
- Add
--format text
options to CLI (in addition to previously-availablecsv
andjson
) (h/t @brandonrobertz). (#1235) - Add
raise_unicode_errors: bool
parameter topdfplumber.open()
to allow bypassingUnicodeDecodeError
s in annotation-parsing and generate warnings instead (h/t @stolarczyk). (#1195) - Add
name
property toimage
objects (h/t @djr2015). (#1201)
Fixed
- Fix
PageImage.debug_tablefinder(...)
so that its main keyword argument is named the same (table_settings=
) as other relatedPage
methods (h/t @stolarczyk). (#1237)
v0.11.4
v0.11.3
Added
- Add
Table.columns
, analogous toTable.rows
(h/t @Pk13055). (#1050 + d39302f) - Add
Page.extract_words(return_chars=True)
, mirroringPage.search(..., return_chars=True)
; if this argument is passed, each word dictionary will include an additional key-value pair:"chars": [char_object, ...]
(h/t @cmdlineluser). (#1173 + 1496cbd) - Add
pdfplumber.open(unicode_norm="NFC"/"NFD"/"NFKC"/NFKD")
, where the values are the four options for Unicode normalization (h/t @petermr + @agusluques). (#905 + 03a477f)
Changed
- Change default setting
pdfplumber.repair(...)
passes to Ghostscript's-dPDFSETTINGS
parameter, fromprepress
todefault
, and make that setting modifiable via.repair(setting=...)
, where the value is one of"default"
,"prepress"
,"printer"
, or"ebook"
(h/t @Laubeee). (#874 + 48cab3f)
Fixed
- Fix handling of object coordinates when
mediabox
does not begin at(0,0)
(h/t @wodny). (#1181 + 9025c3f + 046bd87) - Fix error on getting
.annots
/.hyperlinks
fromCroppedPage
(due to missing.rotation
and.initial_doctop
attributes) (h/t @Safrone). (#1171 + e5737d2) - Fix problem where
Page.crop(...)
was not cropping.annots/.hyperlinks
(h/t @Safrone). (#1171 + 22494e8) - Fix calculation of coordinates for
.annots
onCroppedPage
s. (0bbb340 + b16acc3) - Dereference structure element attributes (h/t @dhdaines). (#1169 + 3f16180)
- Fix
Page.get_attr(...)
so that it fully resolves references before determining whether the attribute's value isNone
(h/t @zzhangyun + @mkl-public). (#1176 + c20cd3b)
v0.11.2
Added
- Add
extra_attrs
parameter to.dedupe_chars(...)
to adjust the properties used when deduplicating (h/t @QuentinAndre11). (#1114)
Development Changes
- Remove testing for Python 3.8, add testing for Python 3.12. (944eaed)
- Upgrade
flake8
,pytest
, andpytest-cov
— and addsetuptools
andpy
as explicit dev requirements (for Python 3.12).
v0.11.1
v0.11.0
Summary: More control over the {left-to-right, right-to-left, top-to-bottom, bottom-to-top} direction that pdfplumber
reads/writes text (many thanks to @afriedman412 for the idea and prototype in #1040), plus upgrading to pdfminer.six
's latest release (which provides more detailed paths for curves), and some fixes.
Added
- Add
{line,char}_dir{,rotated,render}
params, to provide better support for non–top-to-bottom, left-to-right text (h/t @afriedman412). (850fd45) - Add
curve["path"]
andcurve["dash"]
, thanks topdfminer.six
upgrade (see below). (1820247)
Changed
- Upgrade
pdfminer.six
from20221105
to20231228
. (cd2f768) - Change value of in
word["direction"]
from{1,-1}
to{"ltr","rtl","ttb","btt"}
. (850fd45) - Deprecate
vertical_ttb
,horizontal_ltr
in favor ofchar_dir
andchar_dir_rotated
.(850fd45)
Fixed
v0.10.4
Added
- Add
x_tolerance_ratio
parameter toextract_text
and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). (#1041) - Add support for PDF 1.3 logical structure via
Page.structure_tree
(h/t @dhdaines). (#963) - Add "gswin64c" as another possible Ghostscript executable in
repair.py
(h/t @echedey-ls). (#1032) - Re-add
Page.close()
method, havePDF.close()
close all pages as well, and improve relevant documentation (h/t @luketudge). (#1042) - Add
force_mediabox
parameter toPage.to_image(...)
. (#1054)
Fixed
- Standardize handling of cropbox, fixing various issues with PageImage. (#1054)
- Fix
Page.get_textmap
caching to allow forextra_attrs=[...]
, by preconverting list kwargs to tuples. (#1030) - Explicitly close
pypdfium2.PdfDocument
inget_page_image
(h/t @dhdaines). (#1090) - In
PDFPageAggregatorWithMarkedContent.tag_cur_item
, checkself.cur_item._objs
length before trying to access[-1]
. (4f39d03)
v0.10.3
Added
- Add support for marked-content sequences, represented by
mcid
andtag
attributes onchar
/rect
/line
/curve
/image
objects (h/t @dhdaines). (#961) - Add
gs_path
argument topdfplumber.open(...)
andpdfplumber.repair(...)
, to allow passing a custom Ghostscript path to be used for repairing. (#953)