#pagexml — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #pagexml, aggregated by home.social.

Benjamin Rosemann @[email protected] · 2024-07-06 · 06:49 UTC

@tkinias as far as I understand you want to implement a PDF -> Text -> PDF workflow. Using plaintext as intermediate is problematic, as you (may) lose a lot of layout information.
For high quality fulltext you may need a more sophisticated intermediate format like #PageXML or #AltoXML. But they also require a more sophisticated tool for editing like #OCR4All.

#pagexml #altoxml #ocr4all
Stefan Weil @[email protected] · 2024-06-06 · 13:55 UTC

Extra zur #BiblioCON24 gibt's das neue Release 5.4.0 für #TesseractOCR, unsere Standardlösung für die automatisierte Texterkennung (nicht nur) bei der #Zeitungsdigitalisierung. Tesseract kann jetzt auch #PAGEXML erzeugen und generiert schönere PDF-Dateien.

#bibliocon24 #tesseractocr #zeitungsdigitalisierung #pagexml
Janne @[email protected] · 2023-03-04 · 16:47 UTC

@einerseits Interesting project! I'm experimenting with #eScriptorium and #kraken #OCR for recognition of German Kurrent. Can you possibly shed light on your transcription process? And do you provide the OCR full text files in #ALTO or #PageXML as well or plan on doing so in the future?

#escriptorium #kraken #ocr #alto #pagexml