#tika — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #tika, aggregated by home.social.
-
Sicherheitsupdates: #Apache HTTP Server und #Tika sind verwundbar | Security https://www.heise.de/news/Sicherheitsupdates-Apache-HTTP-Server-und-Tika-sind-verwundbar-11105834.html #Patchday
-
-
Wenn eine ältere Frau Dir bei Dashain eine #Tika auf die Stirn malt, ist das eine wunderbare und respektvolle Geste voller Liebe, Schutz und Segen.
Es ist ein Moment, der zeigt: „Du bist Teil unserer Familie, wir beschützen und unterstützen dich.“
-
#PDF digital archaeology: in a more or less broken PDF of an academic paper I found a lot of unreferenced objects. And some of them contain understandable and meaningful text, probably from the authors, not displayed by Adobe Reader nor extracted by #Tika. . Exciting isn't it?
(Oh, and remember users of #pdfTeX that apparently when you insert an image inside a PDF, the absolute path and filename of the image is written in the PDF...)
-
I am still extremely flashed what tika in opencloud can do, I have lots of pictures of steam engines. from many closeup fotos tika picks up the technical data noted on the loco #opencloud #tika
-
Hi @tallison ! I have used #Tika to extract the text from a ~170 Tb set of files, in batch mode, from the CLI.
I have two questions:
1) The result is ~17K text files while I was expecting ~23K files. Are there any formats that are just ignored by Tika (AFAIK, there were no exotic formats in this corpus)?
2) In batch mode, is Tika applying OCR? It did not throw a warning like when used in regular mode. -
-
Just found out there's now a development prototype of veraPDF-rest, which exposes #VeraPDF's functionality through a REST API:
https://github.com/veraPDF/veraPDF-rest
Will need to try this out, but this definitely looks really useful!
This could also be good for developing performant VeraPDF wrappers in other programming languages, like Python (similar to how Tika-python currently wraps around #Apache #Tika's REST API).
-
#vectordatabases are amazing. Created an Q&A #ai running completely local. And it is shockingly good whilst being shockingly easy to implement...
I just dump a folder of PDFs into #apache #tika. Concat them, split them by /n/n to get paragraphs. Yoink them into #chromadb. Done
Now I can pose a question that will query chroma to return 20 semantically similar documents. Those documents are dumped into a mixtral-instruct in combination with the original question.
The results are nearly perfect! -
@_DigitalWriter_ Darauf freue ich mich riesig! Einen ganz großen Dank im Vorfeld Herbert! 😊 🙏
Werden zufällig sogar #tika und #gotenberg Thema sein, wodurch neben #office Dateien wohl sogar #EML verarbeitet werden kann? 😮
Damit wäre endlich "nebenbei" (m)eine plattformübergreifende Lösung zur #email #archivierung gegeben? 😍https://docs.paperless-ngx.com/configuration#tika
Das finde ich noch etwas kompliziert und würde mir sehr helfen, #paperlessngx auf der Projektagenda nach oben zu schieben! 🥳
-
New blog post, in which I review and test some options for extracting unformatted text from #EPUB files in Python, using #Apache #Tika (via #Tika-python), #Textract and #EbookLib.
Includes link to Git repo with demo scripts.
https://www.bitsgalore.org/2023/03/09/extracting-text-from-epub-files-in-python
-
Yesterday I re-ran a script I made a few weeks ago that extracts text from some EPUB files with #Apache #Tika. To my surprise the output was slightly different from the original run (adding some garbage text to the end of the file), even though I was using the same Tika version!
Turns out this is because by default, Tika OCRs any embedded images if it finds a Tesseract installation. Apparently I installed this as part of some other software after I first ran the script.
-
Java News Roundup: More Log4Shell Statements, Spring and Quarkus Updates, New Value Objects JEP
https://www.infoq.com/news/2021/12/java-news-roundup-dec20-2021/
#java_news roundup dec20 2021 #Development #Architecture_& Design #DevOps #Open_JDK #Quarkus #Java #JDK #Apache_Camel #Tika #JDK_19 #Hibernate_ORM #Project_Loom #JDK_18 #log4j