#ocrmypdf — Public Fediverse posts on home.social

Itan :fediverse: @[email protected] · 2026-04-26 · 16:19 UTC

Comparando archivos PDF (en Cápsula (e)Lucubración)

Cómo comparar dos archivos PDF

<gemini://itan.pollux.casa/elucubracion/comparando_dos_pdf.gmi>

#GemistaciónItan #PDF #comparar #GNU-Linux #DiffPDF #OCRmyPDF #CápsulaLucubración #geminiSpace #geminiProtocol #geminiEspacio

#gemistacionitan #pdf #comparar #gnu #diffpdf #ocrmypdf

Itan :fediverse: @[email protected] · 2026-04-26 · 16:19 UTC

Comparando archivos PDF (en Cápsula (e)Lucubración)

Cómo comparar dos archivos PDF

<gemini://itan.pollux.casa/elucubracion/comparando_dos_pdf.gmi>

#GemistaciónItan #PDF #comparar #GNU-Linux #DiffPDF #OCRmyPDF #CápsulaLucubración #geminiSpace #geminiProtocol #geminiEspacio

#gemistacionitan #pdf #comparar #gnu #diffpdf #ocrmypdf

Itan :fediverse: @[email protected] · 2026-04-26 · 16:19 UTC

Comparando archivos PDF (en Cápsula (e)Lucubración)

Cómo comparar dos archivos PDF

<gemini://itan.pollux.casa/elucubracion/comparando_dos_pdf.gmi>

#GemistaciónItan #PDF #comparar #GNU-Linux #DiffPDF #OCRmyPDF #CápsulaLucubración #geminiSpace #geminiProtocol #geminiEspacio

#geminiespacio #geminiprotocol #geminispace #capsulalucubracion #ocrmypdf #diffpdf

Itan :fediverse: @[email protected] · 2026-04-26 · 16:19 UTC

Comparando archivos PDF (en Cápsula (e)Lucubración)

Cómo comparar dos archivos PDF

<gemini://itan.pollux.casa/elucubracion/comparando_dos_pdf.gmi>

#GemistaciónItan #PDF #comparar #GNU-Linux #DiffPDF #OCRmyPDF #CápsulaLucubración #geminiSpace #geminiProtocol #geminiEspacio

#gemistacionitan #pdf #comparar #gnu #diffpdf #ocrmypdf

Jeff Fortin T. (風の庭園のNekohayo) @[email protected] · 2026-03-20 · 19:37 UTC

Hey, it turns out that GNOME's "Document Scanner" application (Simple Scan) actually _can_ do Optical Character Recognition, running a post-processing script. It's just really, really, really not obvious (nor easy to set up): https://gitlab.gnome.org/GNOME/simple-scan/-/issues/1#note_2713733

As a stopgap, here's my proposed UI lipstick fix just so that the existing UI's purpose can be understood: https://gitlab.gnome.org/GNOME/simple-scan/-/merge_requests/322

I'm hoping to see a built-in implementation someday.

#SimpleScan #OCR #scanning #productivity #GNOME #UX #OCRmyPDF

#simplescan #ocr #scanning #productivity #gnome #ux

Jim Spath @[email protected] · 2026-02-05 · 22:23 UTC

Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
=
Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
=
1. Mana should have been Maria.

2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

Otherwise, damned decent!

#xsane #ocrmypdf

Jim Spath @[email protected] · 2026-02-05 · 22:23 UTC

Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
=
Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
=
1. Mana should have been Maria.

2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

Otherwise, damned decent!

#xsane #ocrmypdf

Jim Spath @[email protected] · 2026-02-05 · 22:23 UTC

Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
=
Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
=
1. Mana should have been Maria.

2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

Otherwise, damned decent!

#xsane #ocrmypdf

Jim Spath @[email protected] · 2026-02-05 · 22:23 UTC

Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
=
Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
=
1. Mana should have been Maria.

2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

Otherwise, damned decent!

#ocrmypdf #xsane

Jim Spath @[email protected] · 2026-02-05 · 22:23 UTC

Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
=
Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
=
1. Mana should have been Maria.

2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

Otherwise, damned decent!

#xsane #ocrmypdf

Liane M. Dubowy @[email protected] · 2026-01-06 · 10:29 UTC

@WorziArmin Ein Kollege hatte schon mal Tools fürs #Dokumentenmanagement vorgestellt. Aber ich fürchte: Das erfordert noch mehr Disziplin. #OCRmyPDF kann das Problem nicht lösen, das scannt ja nur ein und macht die Texterkennung. Für alle, die keine Lust haben zu sortieren, empfehle ich tatsächlich #Recoll. Festplatte indizieren, dann findet das fast alles. Aber mich würde das Chaos auf der Festplatte irre machen.

#dokumentenmanagement #ocrmypdf #recoll

eWe @[email protected] · 2025-11-20 · 11:20 UTC

¯\_(ツ)_/¯ *meh
Homebrew pillow 12.0.0 Upgrade macht meinen PDF Workflow kaputt :(
Aber ich kann nicht downgraden auf die 11.3.0 weil dependencies
Und weil homebrew die alte Version nicht gelistet hat?

Hmpf

#homebrew #python #ocrmypdf

Tim Schlotfeldt ⚓🏳️‍🌈 @[email protected] · 2025-10-28 · 10:18 UTC

@Martin Seeger Ah, Benamung ist echt ein Thema. Und dann auch wieder nicht. Mein Benamungsschema für Dateien ist Datum-Typ-Ersteller.

Ich benutze allerdings kein #paperless sondern mache das händisch mit #ocrmypdf. Die Dateien sortiere ich in eine Verzeichnisstruktur. Und dank OCR findet bei mir #Recoll dann alles wieder. @Bastian

#paperless #ocrmypdf #recoll

Victor Forberger @vforberger · 2025-10-11 · 19:27 UTC

@D_J_Nathanson

#pdftk for terminal
@libreoffice draw
#masterpdf v4 is free; current version is paid
#ocrmypdf
#pdfunite etc

I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

#pdftk #masterpdf #ocrmypdf #pdfunite

Victor Forberger @[email protected] · 2025-10-11 · 19:27 UTC

@D_J_Nathanson

#pdftk for terminal
@libreoffice draw
#masterpdf v4 is free; current version is paid
#ocrmypdf
#pdfunite etc

I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

#pdftk #masterpdf #ocrmypdf #pdfunite

Victor Forberger @[email protected] · 2025-10-11 · 19:27 UTC

@D_J_Nathanson

#pdftk for terminal
@libreoffice draw
#masterpdf v4 is free; current version is paid
#ocrmypdf
#pdfunite etc

I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

#pdftk #masterpdf #ocrmypdf #pdfunite

Victor Forberger @[email protected] · 2025-10-11 · 19:27 UTC

@D_J_Nathanson

#pdftk for terminal
@libreoffice draw
#masterpdf v4 is free; current version is paid
#ocrmypdf
#pdfunite etc

I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

#pdfunite #ocrmypdf #masterpdf #pdftk

Victor Forberger @[email protected] · 2025-10-11 · 19:27 UTC

@D_J_Nathanson

#pdftk for terminal
@libreoffice draw
#masterpdf v4 is free; current version is paid
#ocrmypdf
#pdfunite etc

I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

#pdftk #masterpdf #ocrmypdf #pdfunite

Jonathan Kamens 86 47 @[email protected] · 2025-09-09 · 08:31 UTC

Editing or redacting a #PDF using #LibreOffice Draw is far superior to the commonly used method of converting the PDF's pages into images and editing the images, because the latter results in a PDF that is many times larger and doesn't render as well. Also, text copy and paste is lost, which you can recover from to some extent with a tool like #OCRmyPDF, but you'll never get the text quality back to as high as it was before you converted the PDF to images.
#FOSS

#pdf #libreoffice #ocrmypdf #foss

Samuel Plumppu @Greenheart · 2025-09-05 · 10:00 UTC

Have you ever needed to extract text from images embedded in a #PDF? I can highly recommend the open source #CLI tool #OCRmyPDF which is easy to automate in for example a #DataPipeline.

It uses #Tesseract #OCR under the hood and has many options to experiment with to get the best possible accuracy for your language and PDF content.

You can get started with just a few commands:

https://samuelplumppu.se/blog/automated-text-extraction-from-pdf-images-with-ocrmypdf

#pdf #cli #ocrmypdf #datapipeline #tesseract #ocr

Habr @[email protected] · 2025-08-24 · 18:22 UTC

Добавление OCR-слоя и другие преобразования PDF

При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

https://habr.com/ru/companies/globalsign/articles/940286/

#pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

#ocr #ocrmypdf #конвертация #markitdown #syntax #pdf

Habr @[email protected] · 2025-08-24 · 18:22 UTC

Добавление OCR-слоя и другие преобразования PDF

При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

https://habr.com/ru/companies/globalsign/articles/940286/

#pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

#ocr #ocrmypdf #конвертация #markitdown #syntax #pdf

Habr @[email protected] · 2025-08-24 · 18:22 UTC

Добавление OCR-слоя и другие преобразования PDF

При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

https://habr.com/ru/companies/globalsign/articles/940286/

#pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

#ocr #ocrmypdf #конвертация #markitdown #syntax #pdf

Habr @[email protected] · 2025-08-24 · 18:22 UTC

Добавление OCR-слоя и другие преобразования PDF

При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

https://habr.com/ru/companies/globalsign/articles/940286/

#pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

Dustin @[email protected] · 2025-03-20 · 15:42 UTC

2/2 re #OCR

All three were set to #rotate and #deskew. None rotated the page that was sideways, but they all #deskewed pages that needed it. Kofax was the speediest of the bunch, then #OCRmyPDF not far behind and #Foxit was by far the slowest.

File size Foxit produced the smallest file size, #Kofax created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

#ocr #rotate #deskew #deskewed #ocrmypdf #foxit

Dustin @[email protected] · 2025-03-20 · 15:42 UTC

2/2 re #OCR

All three were set to #rotate and #deskew. None rotated the page that was sideways, but they all #deskewed pages that needed it. Kofax was the speediest of the bunch, then #OCRmyPDF not far behind and #Foxit was by far the slowest.

File size Foxit produced the smallest file size, #Kofax created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

#ocr #rotate #deskew #deskewed #ocrmypdf #foxit

Dustin @[email protected] · 2025-03-20 · 15:42 UTC

2/2 re #OCR

All three were set to #rotate and #deskew. None rotated the page that was sideways, but they all #deskewed pages that needed it. Kofax was the speediest of the bunch, then #OCRmyPDF not far behind and #Foxit was by far the slowest.

File size Foxit produced the smallest file size, #Kofax created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

#kofax #foxit #ocrmypdf #deskewed #deskew #rotate

Dustin @[email protected] · 2025-03-20 · 15:42 UTC

2/2 re #OCR

All three were set to #rotate and #deskew. None rotated the page that was sideways, but they all #deskewed pages that needed it. Kofax was the speediest of the bunch, then #OCRmyPDF not far behind and #Foxit was by far the slowest.

File size Foxit produced the smallest file size, #Kofax created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

#ocr #rotate #deskew #deskewed #ocrmypdf #foxit

Dustin @[email protected] · 2025-03-20 · 15:42 UTC

1/2 re OCR

I got to do a fun test at work with #OCR. #Foxit Phantom with AbbyFineReader from 2013, #Kofax Power PDF from 2020 and #OCRmyPDF via #WSL with #Tesseract as the OCR engine.

The best results in OCR were from OCRmyPDF great results. Second was Kofax lagging was the over 10-year-old Foxit. OCRmyPDF did perform great and just picked up a few more characters, especially fuzzy scanned text, plus it got some handwritten text.

#ocr #foxit #kofax #ocrmypdf #wsl #tesseract

El Tuto @[email protected] · 2025-03-06 · 13:17 UTC

Meine Fresse, sind wir heute wieder aktuell:
#PDF #OCR #CLI #Stapelverarbeitung ... als wäre es 2005 ;)

Wobei: Seit Dokumente zunehmend per Handy "gescannt" werden, könnte (nachträgliche) Texterkennung doch recht aktuell sein :)

https://www.tutonaut.de/pdf-texterkennung-stapelweise-fuer-windows-und-linux/

#Opensource #OCRmyPDF

#pdf #ocr #cli #stapelverarbeitung #opensource #ocrmypdf

Habr @[email protected] · 2025-03-02 · 13:02 UTC

Цифровой архив с полнотекстовым поиском, в том числе по PDF и картинкам

У каждого человека с годами скапливается множество бумажных документов, в которых непросто разобраться или что-то найти. Эта проблема ещё более актуальна для организаций. Опенсорсная программа Paperless-ngx позиционируется как оптимальное решение для создания цифрового архива. Со встроенной системой распознавание символов (OCR) и обучением на основе ранее отсканированных документов она создаёт хранилище с поиском, где можно быстро найти любой документ. Всем документам присваиваются теги, так что они могут присутствовать в разных тематических категориях, это удобнее распределения по папкам. Paperless-ngx можно установить на домашний сервер и загружать документы через браузер с любого устройства.

https://habr.com/ru/companies/globalsign/articles/887176/

#Paperless #Paperlessngx #цифровое_хранилище #электронный_документооборот #сканирование_документов #OCRmyPDF

#ocrmypdf #сканирование_документов #электронный_документооборот #цифровое_хранилище #paperlessngx #paperless

Dustin @[email protected] · 2025-02-20 · 14:27 UTC

#WSL is nice because I can use #OCRmyPDF on #Ubuntu. I set it up to watch a folder for any new #PDF then automatically #deskew #rotate #OCR then #export to a "done" folder. It is very nice to have it done automatically in the background. No more opening, clicking to OCR and waiting on the software and unable to open other PDFs. Plus, this process is way lighter on resources. Man, I love #OpenSource.
#Tesseract

#wsl #ocrmypdf #ubuntu #pdf #deskew #rotate

ResearchBuzz: Firehose @[email protected] · 2025-02-14 · 10:58 UTC

Lifehacker: This Free Tool Can Help You Search and Copy (Nearly) Any PDF. “There’s nothing worse than opening a PDF and realizing you can’t use the search function or even highlight text. This typically happens when a PDF was created by scanning a paper document — it’s just a series of images. Most modern scanning software uses Optical Character Recognition (OCR) so that words are both […]

https://rbfirehose.com/2025/02/14/lifehacker-this-free-tool-can-help-you-search-and-copy-nearly-any-pdf/

#ocr #ocrmypdf #opensource #pdf #pdfediting

stirz ✅ @[email protected] · 2024-09-08 · 15:50 UTC

For my server backend, I used a #python script to handle the requests.

Basically, it makes use of two components:

#CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

#imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

#OCRmyPDF to make it more convenient to work with the scanned sheet

#python #cv2 #imagemagick #ocrmypdf

stirz ✅ @[email protected] · 2024-09-08 · 15:50 UTC

For my server backend, I used a #python script to handle the requests.

Basically, it makes use of two components:

#CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

#imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

#OCRmyPDF to make it more convenient to work with the scanned sheet

#python #cv2 #imagemagick #ocrmypdf

stirz ✅ @[email protected] · 2024-09-08 · 15:50 UTC

For my server backend, I used a #python script to handle the requests.

Basically, it makes use of two components:

#CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

#imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

#OCRmyPDF to make it more convenient to work with the scanned sheet

#python #cv2 #imagemagick #ocrmypdf

stirz ✅ @[email protected] · 2024-09-08 · 15:50 UTC

For my server backend, I used a #python script to handle the requests.

Basically, it makes use of two components:

#CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

#imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

#OCRmyPDF to make it more convenient to work with the scanned sheet

#ocrmypdf #imagemagick #cv2 #python

stirz ✅ @[email protected] · 2024-09-08 · 15:50 UTC

For my server backend, I used a #python script to handle the requests.

Basically, it makes use of two components:

#CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

#imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

#OCRmyPDF to make it more convenient to work with the scanned sheet

#python #cv2 #imagemagick #ocrmypdf

R. L. Dane :debian: :openbsd: @RL_Dane · 2024-07-24 · 18:39 UTC

@giantspacesquid

I'm just guessing that #ocrmypdf applied some compression options that the scanning software (#KDE Skanpage) didn't.

#ocrmypdf #kde

Elias Probst @[email protected] · 2024-07-03 · 12:43 UTC

@nielso darf ich dir von unserem Herrn & Erlöser #PaperlessNGX (welcher auch die Wunder des #OCRmyPDF zum Nutzen seiner Jünger mehret) predigen? 😅

#paperlessngx #ocrmypdf

Nielso @[email protected] · 2024-07-03 · 12:06 UTC

Die kleinen Freuden der freien Welt: Ein Script hacken, das auf dem Samba-Serverchen (Fujitsu Thin Client) dort vom Brother Büromonster abgelegten Scans annimmt und durch #ocrmypdf laufen lässt und das Ergebnis ebenfalls auf dem Samba-Share ablädt.

#ocrmypdf

buja @[email protected] · 2024-06-16 · 17:48 UTC

@nyx I just stumbled over this post and it got me thinking...

If you´re still on the lookout...perhaps #OCRmyPDF can help you.

https://github.com/ocrmypdf/OCRmyPDF

You would have to convert your documents to pdf and then throw them at OCRmyPDF to create searchable pdf files.

If you want a nice #selfhosted web ui, #PaperlessNGX uses OCRmyPDF internally. You can run it in a #Docker #Container, upload your pdf files into it and when it has done it´s thing, enjoy your searchable pdfs.

#ocrmypdf #selfhosted #paperlessngx #docker #container

Elias Probst @[email protected] · 2024-02-25 · 17:32 UTC

@lauren for going down this route, you might want to give #OCRmyPDF a try:
https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#produce-pdf-and-text-file-containing-ocr-text

@alastair @jamesbritt

#ocrmypdf

Jonathan Kamens 86 47 @[email protected] · 2024-02-04 · 01:44 UTC

Today I discovered #OCRmyPDF, a free tool for adding missing searchable text to #PDF files that contain images of text. It supports Linux, Windows, macOS, and FreeBSD. It works well, and it's easy to use. It's actively developed and maintained, so if you run into a problem and report it, there's a good chance it'll be fixed. If you've ever been frustrated by being unable to search a PDF file, this is the tool for you!
Ref: https://github.com/ocrmypdf/OCRmyPDF/blob/main/README.md
#FOSS

#ocrmypdf #pdf #foss

podfeet @[email protected] · 2024-01-01 · 01:41 UTC

NC #973 OCR PDFs for Free with Shortcuts (and Automator), Story of Dark Patterns with Bart Busschots https://www.podfeet.com/blog/2023/12/nc-973/

#OCRMyPDF #OCR #AppleShortcuts #DarkPatterns

#ocrmypdf #ocr #appleshortcuts #darkpatterns

podfeet @[email protected] · 2023-12-31 · 23:54 UTC

OCR PDFs using Free Open Source Tools with Apple Shortcuts (or Automator) https://www.podfeet.com/blog/2023/12/ocr-pdf-quick-action/

#AppleShortcuts #Apple #OpenSource #OCR #OCRMyPDF

#appleshortcuts #apple #opensource #ocr #ocrmypdf

podfeet @[email protected] · 2023-12-12 · 18:45 UTC

OCR PDFs using Free Open Source Tools with a Shell Script and Keyboard Maestro https://www.podfeet.com/blog/2023/12/ocr-pdf-shell-script-keyboard-maestro/

#OCRMyPDF #ShellScript #Bash #Programming

#ocrmypdf #shellscript #bash #programming

Albert Cardona @[email protected] · 2023-12-12 · 10:27 UTC

When you find a webpage that offers you a book but you can't download it, and you can't right-click to save the images of its pages, well – the page has loaded the images. Therefore the images are somewhere in your browser. What to do?

Knowing a bit of how web pages are structured and built helps make the most of what you see online.

1. In your browser, open the developer tools (push F12).

2. Go to the "Network" tab and restrict the view to "Images" and "Media" (see the upper right side).

3. Zoom into the book to ensure pages are of high resolution, then pass the pages.

4. You will notice new rows appearing into the table of the "Network" tab of the Developer Tools.

5. Now move your mouse over them and the image may even be shown to you; in any case just right-click and save it.

There are scripts online to automate this, but if all you are after are a few pages, this suffices.

To montage the pages into a PDF, use e.g.:

$ img2pdf *jpg -o book.pdf

... and even OCR them if you like:

$ ocrmypdf book.pdf book-OCR.pdf

Both programs can be installed with:

$ sudo apt get install img2pdf ocrmypdf

... in ubuntu, debian, and the like.

Or, import each into a page of a multi-page #Inkscape document and save it as a PDF.

#img2pdf #ocrmypdf

#ocrmypdf #img2pdf #inkscape

Neal PP @[email protected] · 2023-12-09 · 11:34 UTC

Wonderful #foss. Had to convert some paper exam papers into readable PDFs this weekend.

#Ubuntu and #gnome document scanner quickly got them scanned. Followed by the superb #ocrmypdf converted the image to text. No Abobe licenses here!!

#foss #ubuntu #gnome #ocrmypdf

zrzz @[email protected] · 2023-11-18 · 06:44 UTC

I am a rulebook hoarder. Whenever I take a closer look at a game downloading the rulebook is the first thing I do. I have over 2500 boardgame related pdf files. I access them using pdf-tools in #Emacs, index them using #recoll and I use a small hack to make M-x pdfgrep search using the recoll index. I use the #OCRmyPDF tool to OCR the ones that didn't come with embedded text.
#boardgames

#emacs #recoll #ocrmypdf #boardgames