[camelot] read_pdf() 파라미터

Python tech/Computer Vision

[camelot] read_pdf() 파라미터

콜레오네 2021. 2. 8. 20:18

Camelot이란?

PDF 파일에서 표(Table, 이하 테이블)을 추출(Extract)해주는 Python 라이브러리

camelot을 사용하는 메인 방법은 read_pdf() 함수를 사용하는 것입니다.

read_pdf()의 파라미터를 적절하게 적용시키면

PDF에서 테이블을 적절하게 추출하여 렌더링할 수 있습니다.

본 포스팅은 caemlot 공식 문서를 참고하였습니다.

camelot-py.readthedocs.io/en/master/index.html

기본 사용법

camelot.read_pdf("./dir/file.pdf", pages="all")

첫 번째 인자로 pdf 파일 경로

두 번째 인자로 테이블을 추출할 페이지 범위를 설정해줍니다.

가장 기본이 되는 사용 방법입니다.

파라미터 우측에 [stream] 혹은 [lattice]는 각각 stream과 lattice 방식에서만 사용 가능합니다.

아무런 표시가 없으면 공통적으로 사용 가능합니다.

password

문서 비밀번호가 있는 경우 설정
default : None
example) "my-password"

flavor

Parsing method
default : "lattice"
example) only "lattice" or "stream"

layout_kwargs

pdfminer.layout.LAParamskwargs
default : {}

suppress_stdout

log, warning 출력 여부
default : False
example) True

table_areas

테이블 테두리를 지정 (테이블 범위), ["x1, x2, y1, y2"]
default : None
example) ["312,489,566,325"]

table_regions

테이블을 탐색할 범위를 지정, ["x1, x2, y1, y2"]
default : None
example) ["170,370,560,270"]

split_text

셀 데이터가 병합된 경우, 분리시켜줌
default : Falses
example) True

flag_size

윗첨자 혹은 아랫첨자를 구분하기 위해 태그(Tag)를 달아줌
default : False
example) True

strip_text

불필요한 문자 제거
default : ""
example) "\n"

columns [stream]

테이블 테두리의 세로 선이 명확하지 않은 경우 -> x 좌표를 지정해서 구분
default : None
example) ["121,168,268,432,552"]

row_tol [stream]

row를 더 가까이 그룹핑
default : 2
example) 4

column_tol [stream]

column을 더 가까이 그룹핑
default : 0
example) 2

edge_tol [stream]

텍스트 줄 사이 간격 크기
default : 50
example) 150

process_background [lattice]

배경 색이 있는 경우, 배경 선을 감지
default : False
example) True

line_scale [lattice]

테이블 테두리 사이즈
default : 15
example) 40

copy_text [lattice]

병합된 셀의 텍스트를 가로 혹은 세로 방향으로 복사해줌
default : None
example) ["h"] or ["v"]

shift_text [lattice]

병합된 셀에서 텍스트 방향(위치) 설정 ["","l","r","t","b"]
default : ["l", "t"]
example) ["r", "b"]

line_tol [lattice]

row, column을 더 가까이 그룹핑
default : 2
example) 15

joint_tol [lattice]

line, joint를 더 가까이 그룹핑
default : 2
example) 20

threshold_blocksize [lattice]

픽셀 임계값 크기 [cv2 Thresholding() parameter], 영역으로 나눌 이웃의 크기
홀수만 입력 가능
default : 15
example) 7

threshold_constant [lattice]

평균값-해당 상수값, 계산된 임계값 결과에서 가감할 상수 [cv2 Thresholding() parameter]
default : 2
example) -1

iterations [lattice]

침식 및 확장 횟수 적용 [cv2 dilate() parameter]
default : 0
example)2

resolution [lattice]

PDF를 PNG로 변환할 때의 해상도
default : 300
example) 500

지금까지 camelot의 read_pdf() 함수의 파라미터에 대해 알아보았습니다.

감사합니다.

'Python tech > Computer Vision' 카테고리의 다른 글

[Python opencv] K-means 활용, 이미지 색상 개수 줄이기 (3)	2021.05.16
[Python opencv] cv2 색상 구분 기준으로 라인 그리기 (0)	2021.05.16
[Python OpenCV] 이미지를 그림화 하기 (색 일반화) (0)	2021.04.05
[camelot] line_scale이란? (opencv로 라인 추출) (0)	2021.03.09
[camelot] PDF 테이블 인식 plot 기능 정리 (0)	2021.02.08

현재글[camelot] read_pdf() 파라미터

Koo's tech diary

백준, 오픈소스 컨트리뷰션, CV2, 감성분석, open source contribution, kobert, 딥러닝, 고급 파이썬, python clean code, 파이썬, visual python, 클린코드, PEP8, 파이썬 클린코드, 오픈소스 개발, Python, 개발자 컨퍼런스, 오픈소스, opencv, image processing,

Today :
Yesterday :

Koo's tech diary