home.social

#tika — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #tika, aggregated by home.social.

  1. Wenn eine ältere Frau Dir bei Dashain eine #Tika auf die Stirn malt, ist das eine wunderbare und respektvolle Geste voller Liebe, Schutz und Segen.

    Es ist ein Moment, der zeigt: „Du bist Teil unserer Familie, wir beschützen und unterstützen dich.“

  2. #PDF digital archaeology: in a more or less broken PDF of an academic paper I found a lot of unreferenced objects. And some of them contain understandable and meaningful text, probably from the authors, not displayed by Adobe Reader nor extracted by #Tika. . Exciting isn't it?

    (Oh, and remember users of #pdfTeX that apparently when you insert an image inside a PDF, the absolute path and filename of the image is written in the PDF...)

  3. I am still extremely flashed what tika in opencloud can do, I have lots of pictures of steam engines. from many closeup fotos tika picks up the technical data noted on the loco

  4. Par ailleurs, je commence à avoir pas mal de petites fifiches sur l'utilisation d'outils de préservation numérique, ou d'outils généralistes utilisables dans un contexte #digipres (#Tika, #7Zip, #Robocopy, #exiftool ...). S'il y a de la demande, je peux faire l'effort de les publier !

  5. Hi @tallison ! I have used #Tika to extract the text from a ~170 Tb set of files, in batch mode, from the CLI.

    I have two questions:
    1) The result is ~17K text files while I was expecting ~23K files. Are there any formats that are just ignored by Tika (AFAIK, there were no exotic formats in this corpus)?
    2) In batch mode, is Tika applying OCR? It did not throw a warning like when used in regular mode.

  6. When your #ownCloud #OCIS takes up more memory than the Apache #Tika #Java process.

  7. Just found out there's now a development prototype of veraPDF-rest, which exposes #VeraPDF's functionality through a REST API:

    github.com/veraPDF/veraPDF-res

    Will need to try this out, but this definitely looks really useful!

    This could also be good for developing performant VeraPDF wrappers in other programming languages, like Python (similar to how Tika-python currently wraps around #Apache #Tika's REST API).

  8. #vectordatabases are amazing. Created an Q&A #ai running completely local. And it is shockingly good whilst being shockingly easy to implement...

    I just dump a folder of PDFs into #apache #tika. Concat them, split them by /n/n to get paragraphs. Yoink them into #chromadb. Done

    Now I can pose a question that will query chroma to return 20 semantically similar documents. Those documents are dumped into a mixtral-instruct in combination with the original question.
    The results are nearly perfect!

  9. @_DigitalWriter_ Darauf freue ich mich riesig! Einen ganz großen Dank im Vorfeld Herbert! 😊 🙏

    Werden zufällig sogar #tika und #gotenberg Thema sein, wodurch neben #office Dateien wohl sogar #EML verarbeitet werden kann? 😮
    Damit wäre endlich "nebenbei" (m)eine plattformübergreifende Lösung zur #email #archivierung gegeben? 😍

    docs.paperless-ngx.com/configu

    Das finde ich noch etwas kompliziert und würde mir sehr helfen, #paperlessngx auf der Projektagenda nach oben zu schieben! 🥳

    #digitalcleaning

  10. New blog post, in which I review and test some options for extracting unformatted text from #EPUB files in Python, using #Apache #Tika (via #Tika-python), #Textract and #EbookLib.

    Includes link to Git repo with demo scripts.

    bitsgalore.org/2023/03/09/extr

  11. Yesterday I re-ran a script I made a few weeks ago that extracts text from some EPUB files with #Apache #Tika. To my surprise the output was slightly different from the original run (adding some garbage text to the end of the file), even though I was using the same Tika version!

    Turns out this is because by default, Tika OCRs any embedded images if it finds a Tesseract installation. Apparently I installed this as part of some other software after I first ran the script.