home.social

#ocrmypdf — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #ocrmypdf, aggregated by home.social.

  1. Hey, it turns out that GNOME's "Document Scanner" application (Simple Scan) actually _can_ do Optical Character Recognition, running a post-processing script. It's just really, really, really not obvious (nor easy to set up): gitlab.gnome.org/GNOME/simple-

    As a stopgap, here's my proposed UI lipstick fix just so that the existing UI's purpose can be understood: gitlab.gnome.org/GNOME/simple-

    I'm hoping to see a built-in implementation someday.

    #SimpleScan #OCR #scanning #productivity #GNOME #UX #OCRmyPDF

  2. Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
    =
    Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
    =
    1. Mana should have been Maria.

    2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

    Otherwise, damned decent!

  3. Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
    =
    Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
    =
    1. Mana should have been Maria.

    2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

    Otherwise, damned decent!

  4. Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
    =
    Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
    =
    1. Mana should have been Maria.

    2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

    Otherwise, damned decent!

  5. Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
    =
    Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
    =
    1. Mana should have been Maria.

    2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

    Otherwise, damned decent!

  6. Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
    =
    Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
    =
    1. Mana should have been Maria.

    2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

    Otherwise, damned decent!

  7. @WorziArmin Ein Kollege hatte schon mal Tools fürs #Dokumentenmanagement vorgestellt. Aber ich fürchte: Das erfordert noch mehr Disziplin. #OCRmyPDF kann das Problem nicht lösen, das scannt ja nur ein und macht die Texterkennung. Für alle, die keine Lust haben zu sortieren, empfehle ich tatsächlich #Recoll. Festplatte indizieren, dann findet das fast alles. Aber mich würde das Chaos auf der Festplatte irre machen.

  8. ¯\_(ツ)_/¯ *meh
    Homebrew pillow 12.0.0 Upgrade macht meinen PDF Workflow kaputt :(
    Aber ich kann nicht downgraden auf die 11.3.0 weil dependencies
    Und weil homebrew die alte Version nicht gelistet hat?

    Hmpf

    #homebrew #python #ocrmypdf

  9. @Martin Seeger Ah, Benamung ist echt ein Thema. Und dann auch wieder nicht. Mein Benamungsschema für Dateien ist Datum-Typ-Ersteller.

    Ich benutze allerdings kein #paperless sondern mache das händisch mit #ocrmypdf. Die Dateien sortiere ich in eine Verzeichnisstruktur. Und dank OCR findet bei mir #Recoll dann alles wieder. @Bastian
  10. @D_J_Nathanson

    for terminal
    @libreoffice draw
    v4 is free; current version is paid

    etc

    I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

  11. @D_J_Nathanson

    #pdftk for terminal
    @libreoffice draw
    #masterpdf v4 is free; current version is paid
    #ocrmypdf
    #pdfunite etc

    I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

  12. @D_J_Nathanson

    #pdftk for terminal
    @libreoffice draw
    #masterpdf v4 is free; current version is paid
    #ocrmypdf
    #pdfunite etc

    I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

  13. @D_J_Nathanson

    #pdftk for terminal
    @libreoffice draw
    #masterpdf v4 is free; current version is paid
    #ocrmypdf
    #pdfunite etc

    I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

  14. @D_J_Nathanson

    #pdftk for terminal
    @libreoffice draw
    #masterpdf v4 is free; current version is paid
    #ocrmypdf
    #pdfunite etc

    I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

  15. Editing or redacting a #PDF using #LibreOffice Draw is far superior to the commonly used method of converting the PDF's pages into images and editing the images, because the latter results in a PDF that is many times larger and doesn't render as well. Also, text copy and paste is lost, which you can recover from to some extent with a tool like #OCRmyPDF, but you'll never get the text quality back to as high as it was before you converted the PDF to images.
    #FOSS

  16. Have you ever needed to extract text from images embedded in a ? I can highly recommend the open source tool which is easy to automate in for example a .

    It uses under the hood and has many options to experiment with to get the best possible accuracy for your language and PDF content.

    You can get started with just a few commands:

    samuelplumppu.se/blog/automate

  17. Добавление OCR-слоя и другие преобразования PDF

    При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

    habr.com/ru/companies/globalsi

    #pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

  18. Добавление OCR-слоя и другие преобразования PDF

    При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

    habr.com/ru/companies/globalsi

    #pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

  19. Добавление OCR-слоя и другие преобразования PDF

    При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

    habr.com/ru/companies/globalsi

    #pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

  20. Добавление OCR-слоя и другие преобразования PDF

    При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

    habr.com/ru/companies/globalsi

    #pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

  21. 2/2 re #OCR

    All three were set to #rotate and #deskew. None rotated the page that was sideways, but they all #deskewed pages that needed it. Kofax was the speediest of the bunch, then #OCRmyPDF not far behind and #Foxit was by far the slowest.

    File size Foxit produced the smallest file size, #Kofax created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

  22. 2/2 re #OCR

    All three were set to #rotate and #deskew. None rotated the page that was sideways, but they all #deskewed pages that needed it. Kofax was the speediest of the bunch, then #OCRmyPDF not far behind and #Foxit was by far the slowest.

    File size Foxit produced the smallest file size, #Kofax created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

  23. 2/2 re #OCR

    All three were set to #rotate and #deskew. None rotated the page that was sideways, but they all #deskewed pages that needed it. Kofax was the speediest of the bunch, then #OCRmyPDF not far behind and #Foxit was by far the slowest.

    File size Foxit produced the smallest file size, #Kofax created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

  24. 2/2 re #OCR

    All three were set to #rotate and #deskew. None rotated the page that was sideways, but they all #deskewed pages that needed it. Kofax was the speediest of the bunch, then #OCRmyPDF not far behind and #Foxit was by far the slowest.

    File size Foxit produced the smallest file size, #Kofax created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

  25. 1/2 re OCR

    I got to do a fun test at work with #OCR. #Foxit Phantom with AbbyFineReader from 2013, #Kofax Power PDF from 2020 and #OCRmyPDF via #WSL with #Tesseract as the OCR engine.

    The best results in OCR were from OCRmyPDF great results. Second was Kofax lagging was the over 10-year-old Foxit. OCRmyPDF did perform great and just picked up a few more characters, especially fuzzy scanned text, plus it got some handwritten text.

  26. Meine Fresse, sind wir heute wieder aktuell:
    #PDF #OCR #CLI #Stapelverarbeitung ... als wäre es 2005 ;)

    Wobei: Seit Dokumente zunehmend per Handy "gescannt" werden, könnte (nachträgliche) Texterkennung doch recht aktuell sein :)

    tutonaut.de/pdf-texterkennung-

    #Opensource #OCRmyPDF

  27. Цифровой архив с полнотекстовым поиском, в том числе по PDF и картинкам

    У каждого человека с годами скапливается множество бумажных документов, в которых непросто разобраться или что-то найти. Эта проблема ещё более актуальна для организаций. Опенсорсная программа Paperless-ngx позиционируется как оптимальное решение для создания цифрового архива. Со встроенной системой распознавание символов (OCR) и обучением на основе ранее отсканированных документов она создаёт хранилище с поиском, где можно быстро найти любой документ. Всем документам присваиваются теги, так что они могут присутствовать в разных тематических категориях, это удобнее распределения по папкам. Paperless-ngx можно установить на домашний сервер и загружать документы через браузер с любого устройства.

    habr.com/ru/companies/globalsi

    #Paperless #Paperlessngx #цифровое_хранилище #электронный_документооборот #сканирование_документов #OCRmyPDF

  28. #WSL is nice because I can use #OCRmyPDF on #Ubuntu. I set it up to watch a folder for any new #PDF then automatically #deskew #rotate #OCR then #export to a "done" folder. It is very nice to have it done automatically in the background. No more opening, clicking to OCR and waiting on the software and unable to open other PDFs. Plus, this process is way lighter on resources. Man, I love #OpenSource.
    #Tesseract

  29. Lifehacker: This Free Tool Can Help You Search and Copy (Nearly) Any PDF. “There’s nothing worse than opening a PDF and realizing you can’t use the search function or even highlight text. This typically happens when a PDF was created by scanning a paper document — it’s just a series of images. Most modern scanning software uses Optical Character Recognition (OCR) so that words are both […]

    https://rbfirehose.com/2025/02/14/lifehacker-this-free-tool-can-help-you-search-and-copy-nearly-any-pdf/

  30. For my server backend, I used a #python script to handle the requests.

    Basically, it makes use of two components:

    #CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

    #imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

    #OCRmyPDF to make it more convenient to work with the scanned sheet

  31. For my server backend, I used a #python script to handle the requests.

    Basically, it makes use of two components:

    #CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

    #imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

    #OCRmyPDF to make it more convenient to work with the scanned sheet

  32. For my server backend, I used a #python script to handle the requests.

    Basically, it makes use of two components:

    #CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

    #imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

    #OCRmyPDF to make it more convenient to work with the scanned sheet

  33. For my server backend, I used a #python script to handle the requests.

    Basically, it makes use of two components:

    #CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

    #imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

    #OCRmyPDF to make it more convenient to work with the scanned sheet

  34. For my server backend, I used a #python script to handle the requests.

    Basically, it makes use of two components:

    #CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

    #imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

    #OCRmyPDF to make it more convenient to work with the scanned sheet

  35. @giantspacesquid

    I'm just guessing that applied some compression options that the scanning software ( Skanpage) didn't.

  36. @nielso darf ich dir von unserem Herrn & Erlöser #PaperlessNGX (welcher auch die Wunder des #OCRmyPDF zum Nutzen seiner Jünger mehret) predigen? 😅

  37. Die kleinen Freuden der freien Welt: Ein Script hacken, das auf dem Samba-Serverchen (Fujitsu Thin Client) dort vom Brother Büromonster abgelegten Scans annimmt und durch #ocrmypdf laufen lässt und das Ergebnis ebenfalls auf dem Samba-Share ablädt.

  38. @nyx I just stumbled over this post and it got me thinking...

    If you´re still on the lookout...perhaps #OCRmyPDF can help you.

    github.com/ocrmypdf/OCRmyPDF

    You would have to convert your documents to pdf and then throw them at OCRmyPDF to create searchable pdf files.

    If you want a nice #selfhosted web ui, #PaperlessNGX uses OCRmyPDF internally. You can run it in a #Docker #Container, upload your pdf files into it and when it has done it´s thing, enjoy your searchable pdfs.

  39. Today I discovered #OCRmyPDF, a free tool for adding missing searchable text to #PDF files that contain images of text. It supports Linux, Windows, macOS, and FreeBSD. It works well, and it's easy to use. It's actively developed and maintained, so if you run into a problem and report it, there's a good chance it'll be fixed. If you've ever been frustrated by being unable to search a PDF file, this is the tool for you!
    Ref: github.com/ocrmypdf/OCRmyPDF/b
    #FOSS

  40. When you find a webpage that offers you a book but you can't download it, and you can't right-click to save the images of its pages, well – the page has loaded the images. Therefore the images are somewhere in your browser. What to do?

    Knowing a bit of how web pages are structured and built helps make the most of what you see online.

    1. In your browser, open the developer tools (push F12).

    2. Go to the "Network" tab and restrict the view to "Images" and "Media" (see the upper right side).

    3. Zoom into the book to ensure pages are of high resolution, then pass the pages.

    4. You will notice new rows appearing into the table of the "Network" tab of the Developer Tools.

    5. Now move your mouse over them and the image may even be shown to you; in any case just right-click and save it.

    There are scripts online to automate this, but if all you are after are a few pages, this suffices.

    To montage the pages into a PDF, use e.g.:

    $ img2pdf *jpg -o book.pdf

    ... and even OCR them if you like:

    $ ocrmypdf book.pdf book-OCR.pdf

    Both programs can be installed with:

    $ sudo apt get install img2pdf ocrmypdf

    ... in ubuntu, debian, and the like.

    Or, import each into a page of a multi-page #Inkscape document and save it as a PDF.

    #img2pdf #ocrmypdf

  41. Wonderful #foss. Had to convert some paper exam papers into readable PDFs this weekend.

    #Ubuntu and #gnome document scanner quickly got them scanned. Followed by the superb #ocrmypdf converted the image to text. No Abobe licenses here!!

  42. I am a rulebook hoarder. Whenever I take a closer look at a game downloading the rulebook is the first thing I do. I have over 2500 boardgame related pdf files. I access them using pdf-tools in #Emacs, index them using #recoll and I use a small hack to make M-x pdfgrep search using the recoll index. I use the #OCRmyPDF tool to OCR the ones that didn't come with embedded text.
    #boardgames