Submitted by iacoposk8 t3_zvkgar in deeplearning
Hi everyone! I have a 300+ page odt file, which I can simply convert to a txt file. On the other hand, I have several dozen scattered notes that I have to insert into this document. I would like to know if there is a software, a library or a project on github (preferably in python) that can help me find the best, most coherent place to insert this note.
Alternatively, if you were to create the code from scratch, how would you go about it?
knight1511 t1_j1qf7dz wrote
Is the text in image format or can it be directly extracted in digital format?
If it is digital format then you can extract the text directly by using pdfminer. It has packages available in Java and python.
If the the pdf has images inside it you need to ocr the text first. ocrmypdf is a very handy python psckage that uses Google's Tesseract OCR Engine to convert images into digital characters. It is not a perfect process but if the images are of good quality then it is almost perfect. Once you have the text in digital format, it can be indexed by a search engine.
To search through the text you can simply use a ready to use a search engine like elasticsearch. You just need to supply the extracted text to the engine to be indexed. Then you can query it easily.
One easy way to use elastic search is to use it via docker. It's easy to get started with provided you are already familiar with docker
Edit: Alternatively you can explore free and open source software called Papermerge. link