#apachetika — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #apachetika, aggregated by home.social.
-
I recently added fully recursive extraction of embedded files to Apache Tika's commandline.
This will also extract earlier versions of PDFs available through incremental updates.
This feature is still in beta. Let us know what you think.
Details in next toot.
#fileforensics #districtcon #ipres2025
#helpwanted #digipres #fileformatology #ApacheTika -
🧠 Open-source & still evolving:
https://github.com/JohannesRabauer/quantaWhat would you add next, smarter summaries, multi-agent explorers, or something wild?
#LangChain4j #Quarkus #Ollama #pgvector #ApacheTika #AI #Java #LLaMA3 -
Just submitted an "Intro to #ApacheTika" talk to DistrictCon, Year 1. 🤞
There will be PDFs!🤣
#districtcon @DistrictCon
https://sessionize.com/districtcon/ -
Turning NASA Wake-up Calls into data
by @beet_keeperFor a while back then I was into space flight again. Scientists, science communicators, and engineers were all excited for a new era of rocket launches and the potential unification of the human race as we look towards the future.
During that time I discovered Colin Fries’ work in the NASA History Division to document all NASA “Wake-up calls”. A wake-up call is simply a piece of music used to wake astronauts on missions, a different piece of music, daily, for the duration of the flight.
Take, for example, the last Space Shuttle mission (Space Transportation System) STS-135; it was in flight for 13 days, and the wake-up call on day one was Coldplay’s Viva la Vida, while on day 13 it was Kate Smith singing God Bless America.
As a huge music buff who has the radio or music television on 18 hours a day, I really wanted to delve into this further. While Colin’s work is great, it’s just a PDF file (@wtfpdf). A PDF is not an ideal file format for querying data and gleaning new insights. So, while I wanted to explore it, I first decided to turn it into a true dataset. The result was a set of resources, a website, a JSON, a CSV, and an SQLite database which are each more functional and more maintainable over time.
Lets take a look at the results and https://nasawakeupcalls.github.io below!
#ApacheTika #Code #Coding #DataWrangling #Datasette #DatasetteLite #DH #DigitalHumanities #glam #harkive #NASA #NASAWakeUpCall #NASAWakeUpCalls #OpenData #PersonalProjects #Science #Space #SpaceHistory #Twitter #WakeUpCall
-
Turning NASA Wake-up Calls into data
by @beet_keeperFor a while back then I was into space flight again. Scientists, science communicators, and engineers were all excited for a new era of rocket launches and the potential unification of the human race as we look towards the future.
During that time I discovered Colin Fries’ work in the NASA History Division to document all NASA “Wake-up calls”. A wake-up call is simply a piece of music used to wake astronauts on missions, a different piece of music, daily, for the duration of the flight.
Take, for example, the last Space Shuttle mission (Space Transportation System) STS-135; it was in flight for 13 days, and the wake-up call on day one was Coldplay’s Viva la Vida, while on day 13 it was Kate Smith singing God Bless America.
As a huge music buff who has the radio or music television on 18 hours a day, I really wanted to delve into this further. While Colin’s work is great, it’s just a PDF file (@wtfpdf). A PDF is not an ideal file format for querying data and gleaning new insights. So, while I wanted to explore it, I first decided to turn it into a true dataset. The result was a set of resources, a website, a JSON, a CSV, and an SQLite database which are each more functional and more maintainable over time.
Lets take a look at the results and https://nasawakeupcalls.github.io below!
#ApacheTika #Code #Coding #DataWrangling #Datasette #DatasetteLite #DH #DigitalHumanities #glam #harkive #NASA #NASAWakeUpCall #NASAWakeUpCalls #OpenData #PersonalProjects #Science #Space #SpaceHistory #Twitter #WakeUpCall
-
I just added a wrapper for Google's magika detector to Apache Tika.
You can now get detection from `file`, `siegfried` and `magika` (and of course, Tika) in a single parse.
-
Many thanks to the #ApacheSoftwareFoundation for their recent post on #ApacheTika!
Also, of course, many thanks to #DARPA and #ARPA-H for funding #AIxCC and welcoming Tika as a challenge repo.
https://news.apache.org/foundation/entry/asf-project-spotlight-apache-tika
-
And true to form, already opened one issue on #ApacheTika's JIRA: https://issues.apache.org/jira/browse/TIKA-4164 🤣🤣🤣
This is triggered by a file in the #IPRES2023 #bakeoff file set, part of which I'm shamelessly reusing today.
-
I had to cancel my #iPres2023 trip … for reasons. 😭😭😭
In the next couple of months, I’ll try to set up a couple of remote, hands on tutorials for the #ApacheTika gui that I built in response to feedback from the #iPres2022 bake-off.
Friends in #digipres, I’m sorry that I can’t make it, and I wish you all the best for what is going to be an amazing iPres!
-
We recently added metadata tables of features extracted by #ApacheTika, and we already had #pdfinfo metadata as well as provenance (including geoip info from #maxmind ).
Let us know if this corpus is useful at all as is and/or if we can make it more useful.
-
We recently published the results of running #ApacheTika on the corpus with an emphasis on PDF, erm, features.
There are two tables: a) each row is a URL (for the primary/container PDF) and b) each row is a URL for the primary/container PDF OR an attachment within that PDF.
https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/metadata/
-
3) I reran Tika, 'file' and #siegfried on all the files.
You can explore the mimes via datasette: https://corpora.tika.apache.org/datasette
Or, download the whole sqlite db: https://corpora.tika.apache.org/base/share/tika-mimes-20230714.db.gz
I mean, who wouldn't want to spend the weekend looking for differences btwn #siegfried and #file and #ApacheTika?!