[ Tesseract OCR ] 테서랙트 OCR - 오인식된 잘못된 문자를 개선하기 위한 학습 방법

[ Tesseract OCR ] 테서랙트 OCR - 오인식된 잘못된 문자를 개선하기 위한 학습 방법 - 2편

12. 학습에 사용 하는 베이스 모델 다운로드

- 설치한 Tesseract 내부에 있는 traineddata 는 추가 학습 불가능

- https://github.com/tesseract-ocr/tessdata_best [이동]

- 암호 [클릭]

- ZIP 다운로드 [클릭]

13. 압축 해제

- tessdata_best-main.zip [풀기]

14. LSTM 파일 만들기

- CMD [실행]

> combine_tessdata -e C:\Transform\tessdata_best-main\eng.traineddata C:\Transform\eng.test.lstm

> [명령어] [traineddata 위치] [출력 위치]

15. 트레인 리스트 생성

- 메모장 [생성]

- lstmf 파일 경로 [작성]

* 여러 이미지를 LSTMF 파일로 변경 다중 학습 가능

@ TrainList.txt 작성시 주의 사항

- 줄 바꿈 형태가 [CR][LF] 시 학습 오류 발생 [LF] 변경 필요

- NotePad++ [실행]

- 하단 Windows (CR LF) [우클릭]

- Unix (LF) [클릭]

- Unix (LF) & UTF-8 확인 [저장]

16. 테서렉트 학습 진행 저장 폴더 생성

- Result 폴더 [생성]

17. 테서렉트 학습 진행

> lstmtraining --max_iterations 9000 --continue_from=C:\Transform\eng.test.lstm --model_output=C:\Transform\Result\ --train_listfile=C:\Transform\TrainList.txt --traineddata=C:\Transform\tessdata_best-main\eng.traineddata

> [명령어] [옵션] [LSTM 위치] [출력 위치] [TrainList 위치] [trainddata 위치]

OPTIONS

'--debug_interval ': How often to display the alignment. (type:int default:0)
'--net_mode ': Controls network behavior. (type:int default:192)
'--perfect_sample_delay ': How many imperfect samples between perfect ones. (type:int default:0)
'--max_image_MB ': Max memory to use for images. (type:int default:6000)
'--append_index ': Index in continue_from Network at which to attach the new network defined by net_spec (type:int default:-1)
'--max_iterations ': If set, exit after this many iterations. A negative value is interpreted as epochs, 0 means infinite iterations. (type:int default:0)
'--target_error_rate ': Final error rate in percent. (type:double default:0.01)
'--weight_range ': Range of initial random weights. (type:double default:0.1)
'--learning_rate ': Weight factor for new deltas. (type:double default:0.001)
'--momentum ': Decay factor for repeating deltas. (type:double default:0.5)
'--adam_beta ': Decay factor for repeating deltas. (type:double default:0.999)
'--stop_training ': Just convert the training model to a runtime model. (type:bool default:false)
'--convert_to_int ': Convert the recognition model to an integer model. (type:bool default:false)
'--sequential_training ': Use the training files sequentially instead of round-robin. (type:bool default:false)
'--debug_network ': Get info on distribution of weight values (type:bool default:false)
'--randomly_rotate ': Train OSD and randomly turn training samples upside-down (type:bool default:false)
'--net_spec ': Network specification (type:string default:)
'--continue_from ': Existing model to extend (type:string default:)
'--model_output ': Basename for output models (type:string default:lstmtrain)
'--train_listfile ': File listing training files in lstmf training format. (type:string default:)
'--eval_listfile ': File listing eval files in lstmf training format. (type:string default:)
'--traineddata ': Starter traineddata with combined Dawgs/Unicharset/Recoder for language model (type:string default:)
'--old_traineddata ': When changing the character set, this specifies the traineddata with the old character set that is to be replaced (type:string default:)

18. 학습 포인트 중지

> lstmtraining --stop_training --traineddata=C:\Transform\tessdata_best-main\eng.traineddata --continue_from=C:\Transform\Result\_checkpoint --model_output=C:\Transform\eng.test.traineddata

> [명령어] [옵션] [traineddata 위치] [checkpoint 위치] [출력 위치]