home.social

#pagexml — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #pagexml, aggregated by home.social.

  1. @tkinias as far as I understand you want to implement a PDF -> Text -> PDF workflow. Using plaintext as intermediate is problematic, as you (may) lose a lot of layout information.

    For high quality fulltext you may need a more sophisticated intermediate format like #PageXML or #AltoXML. But they also require a more sophisticated tool for editing like #OCR4All.

  2. Extra zur #BiblioCON24 gibt's das neue Release 5.4.0 für #TesseractOCR, unsere Standardlösung für die automatisierte Texterkennung (nicht nur) bei der #Zeitungsdigitalisierung. Tesseract kann jetzt auch #PAGEXML erzeugen und generiert schönere PDF-Dateien.

  3. @einerseits Interesting project! I'm experimenting with #eScriptorium and #kraken #OCR for recognition of German Kurrent. Can you possibly shed light on your transcription process? And do you provide the OCR full text files in #ALTO or #PageXML as well or plan on doing so in the future?