#apachetika — Public Fediverse posts on home.social

Ross Spencer @[email protected] · 2026-05-19 · 22:15 UTC

Porting SafeText and analyzing digital content with Apache Tika

Last year I wrote about pitfalls in modern journalism, especially with regards to receiving documents and information from whistleblowers without offering them adequate protection.

The tl;dr is that you, as a whistleblower, need to protect yourself; and you, as an editor or journalist, need to protect your whistleblowers.

Steganographic fingerprints might be one method adopted to detect someone leaking information. Steganographic characters replace common textual characters with unusual but hard to detect variants, e.g. they look the same to the human eye, or are actually invisible. Using a tool called SafeText by David Jacobson we can identify these hidden fingerprints in the content that you share.

I firmly believe we can find clues about what is important to preserve, or learn to preserve, when we analyse the content of the digital record and not just the (file) format of the digital record.

A file can contain many different features and these are all challenges to their future interpretation, and thus preservation.

I wanted to use SafeText in some of my other non-Python tooling and so I decided to port the code to Golang as a composable module and binary.

By coincidence at the time I started writing this I had also just written about revisiting tikalinkextract and so I thought I would write this small explanation about how you might combine Tika and SafeText to perform some content analysis of your own.

Who knows, maybe we will find a conspiracy. Maybe we’ll find secret codes in our own digital records. Maybe we’ll learn something new about our records…

Lets have a look at putting Tika and SafeText together and see where it goes.

#ApacheTika #authenticity #Code #Coding #ContentAnalysis #Data #DigitalHumanities #digitalLiteracy #DigitalPreservation #Golang #integrity #Metadata #Paradata #SafeText #steganography

#apachetika #authenticity #code #coding #contentanalysis #data

Ross Spencer @[email protected] · 2026-05-19 · 22:15 UTC

(This post is being modified)

#apachetika #authenticity #code #coding #contentanalysis #data

Tim Allison @[email protected] · 2026-05-05 · 11:57 UTC

Voting is underway for #ApacheTika 4.0.0-alpha-1! 🎉

Started work on the 4.x branch in October 2024. Lots has changed, core principles remain.

Many, many thanks to the community of fellow devs and users!

Onwards towards 4.0.0!

https://lists.apache.org/thread/bjowzh4ssgtrghqjk7g2dtn9hs3qmyrv

#apachetika

Tim Allison @[email protected] · 2026-05-05 · 11:57 UTC

Voting is underway for #ApacheTika 4.0.0-alpha-1! 🎉

Started work on the 4.x branch in October 2024. Lots has changed, core principles remain.

Many, many thanks to the community of fellow devs and users!

Onwards towards 4.0.0!

https://lists.apache.org/thread/bjowzh4ssgtrghqjk7g2dtn9hs3qmyrv

#apachetika

Tim Allison @[email protected] · 2026-04-09 · 17:18 UTC

Preview revamp of our website for #ApacheTika 4.x is live: https://tika.apache.org/docs/4.0.0-SNAPSHOT/

Let us know what you think and/or open PRs! Please!

#apachetika

Tim Allison @[email protected] · 2026-04-09 · 17:18 UTC

Preview revamp of our website for #ApacheTika 4.x is live: https://tika.apache.org/docs/4.0.0-SNAPSHOT/

Let us know what you think and/or open PRs! Please!

#apachetika

Tim Allison @[email protected] · 2026-03-18 · 21:47 UTC

Voting is underway for #ApacheTika 3.3.0! Please give it a try and let us know if there are any surprises!

https://lists.apache.org/thread/pq4zjvqf3w5zbm5yoyg14qvr2kpd2by3

#apachetika

Tim Allison @[email protected] · 2026-03-18 · 21:47 UTC

Voting is underway for #ApacheTika 3.3.0! Please give it a try and let us know if there are any surprises!

https://lists.apache.org/thread/pq4zjvqf3w5zbm5yoyg14qvr2kpd2by3

#apachetika

Tim Allison @[email protected] · 2025-12-10 · 22:23 UTC

On #ApacheTika we're moving entirely to json for configuration in 4.x.

If you use tika-server and are interested in runtime configuration, please take a look and offer feedback:

https://lists.apache.org/thread/jlt8jv47t8tm58dlrnxsrfodxm2d6o0z

Please repost for reach.

#apachetika

Tim Allison @[email protected] · 2025-12-10 · 22:23 UTC

On #ApacheTika we're moving entirely to json for configuration in 4.x.

If you use tika-server and are interested in runtime configuration, please take a look and offer feedback:

https://lists.apache.org/thread/jlt8jv47t8tm58dlrnxsrfodxm2d6o0z

Please repost for reach.

#apachetika

Pyrzout :vm: @[email protected] · 2025-12-09 · 07:45 UTC

Apache Tika Vulnerability Widens Across Multiple Modules, Severity Now 10.0 https://thecyberexpress.com/apache-tika-critical-cve/ #TheCyberExpressNews #Vulnerabilities #TheCyberExpress #FirewallDaily #CVE202554988 #ApacheTika #CyberNews #CVE #XML

#thecyberexpressnews #vulnerabilities #thecyberexpress #firewalldaily #cve202554988 #apachetika

Pyrzout :vm: @[email protected] · 2025-12-09 · 07:45 UTC

Apache Tika Vulnerability Widens Across Multiple Modules, Severity Now 10.0 https://thecyberexpress.com/apache-tika-critical-cve/ #TheCyberExpressNews #Vulnerabilities #TheCyberExpress #FirewallDaily #CVE202554988 #ApacheTika #CyberNews #CVE #XML

#thecyberexpressnews #vulnerabilities #thecyberexpress #firewalldaily #cve202554988 #apachetika

OffSequence @[email protected] · 2025-12-06 · 05:34 UTC

⚠️ CRITICAL XXE bug (CVE-2025-66516, CVSS 10.0) in Apache Tika (tika-core, tika-pdf-module, tika-parsers). Exploitation via crafted PDFs can lead to file disclosure & RCE. Upgrade to 3.2.2+ ASAP! https://radar.offseq.com/threat/critical-xxe-bug-cve-2025-66516-cvss-100-hits-apac-d08561e7 #OffSeq #ApacheTika #XXE #Security

#offseq #apachetika #xxe #security

OffSequence @[email protected] · 2025-12-05 · 01:04 UTC

🚨 CVE-2025-66516 CRITICAL: XXE in Apache Tika core (v1.13–3.2.1), tika-pdf-module, tika-parsers. Exploitable via crafted PDF XFA files — risks data exfil & DoS. Patch to 3.2.2+ now! https://radar.offseq.com/threat/cve-2025-66516-cwe-611-improper-restriction-of-xml-fa601313 #OffSeq #ApacheTika #XXE #Vuln

#offseq #apachetika #xxe #vuln

Tim Allison @[email protected] · 2025-11-12 · 17:13 UTC

RE: https://mastodon.social/@tallison/115452030199746498

Please join me tomorrow, November 13 at noon EST to chat #ApacheTika.

Please dm me for the connection info.

#apachetika

Tim Allison @[email protected] · 2025-11-12 · 17:13 UTC

RE: https://mastodon.social/@tallison/115452030199746498

Please join me tomorrow, November 13 at noon EST to chat #ApacheTika.

Please dm me for the connection info.

#apachetika

Tim Allison @[email protected] · 2025-10-30 · 13:51 UTC

LOL.. given that I'm going to be a remote presenter, I taped my Digital Preservation Bake-off talk last night in case I have wifi-problems during the session.

I really wish conferences would require 3 or 4 videos of the talk before I'm allowed to speak.

#ipres2025 #digipresBakeoff #ApacheTika

#ipres2025 #digipresbakeoff #apachetika

Tim Allison @[email protected] · 2025-10-30 · 13:51 UTC

LOL.. given that I'm going to be a remote presenter, I taped my Digital Preservation Bake-off talk last night in case I have wifi-problems during the session.

I really wish conferences would require 3 or 4 videos of the talk before I'm allowed to speak.

#ipres2025 #digipresBakeoff #ApacheTika

#ipres2025 #digipresbakeoff #apachetika

Tim Allison @[email protected] · 2025-10-28 · 13:29 UTC

In belated celebration of World Digital Preservation Day, I'm throwing a "What's new with Apache Tika/Office hours" meetup at noon on November 13 EST.

This is intended for anyone interested in files from search to digital preservation to file forensics/reverse engineering folks.

https://www.meetup.com/apache-tika-community/events/311746184

#wdpd2025 #ApacheTika

#wdpd2025 #apachetika

Tim Allison @[email protected] · 2025-10-28 · 13:29 UTC

In belated celebration of World Digital Preservation Day, I'm throwing a "What's new with Apache Tika/Office hours" meetup at noon on November 13 EST.

This is intended for anyone interested in files from search to digital preservation to file forensics/reverse engineering folks.

https://www.meetup.com/apache-tika-community/events/311746184

#wdpd2025 #ApacheTika

#wdpd2025 #apachetika

Tim Allison @[email protected] · 2025-10-27 · 14:28 UTC

If I hosted an #ApacheTika demo/office hours on Thursday, Nov 6 at noon EST, would that time work?

#wdpd2025

#apachetika #wdpd2025

Tim Allison @[email protected] · 2025-10-27 · 14:28 UTC

If I hosted an #ApacheTika demo/office hours on Thursday, Nov 6 at noon EST, would that time work?

#wdpd2025

#apachetika #wdpd2025

Tim Allison @[email protected] · 2025-10-27 · 14:17 UTC

@mutanthumb

Maybe I should throw a demo/office hours for #ApacheTika on #wdpd2025?

#apachetika #wdpd2025

Tim Allison @[email protected] · 2025-10-27 · 14:17 UTC

@mutanthumb

Maybe I should throw a demo/office hours for #ApacheTika on #wdpd2025?

#apachetika #wdpd2025

Tim Allison @[email protected] · 2025-10-24 · 21:00 UTC

@mutanthumb

Y, #ApacheTika will extract what the PDF alleges it is.

These are some of the fields that I'll focus on in the #digipresBakeoff #ipres2025 #ipresBakeOff

These include pdf/a and pdf/x. hasMarkedContent suggests PDF/UA.

#apachetika #digipresbakeoff #ipres2025 #ipresbakeoff

Tim Allison @[email protected] · 2025-10-24 · 21:00 UTC

@mutanthumb

Y, #ApacheTika will extract what the PDF alleges it is.

These are some of the fields that I'll focus on in the #digipresBakeoff #ipres2025 #ipresBakeOff

These include pdf/a and pdf/x. hasMarkedContent suggests PDF/UA.

#apachetika #digipresbakeoff #ipres2025 #ipresbakeoff

Tim Allison @[email protected] · 2025-10-15 · 10:53 UTC

I recently added fully recursive extraction of embedded files to Apache Tika's commandline.

This will also extract earlier versions of PDFs available through incremental updates.

This feature is still in beta. Let us know what you think.

Details in next toot.

#fileforensics #districtcon #ipres2025
#helpwanted #digipres #fileformatology #ApacheTika

#fileforensics #districtcon #ipres2025 #helpwanted #digipres #fileformatology

Tim Allison @[email protected] · 2025-10-15 · 10:53 UTC

I recently added fully recursive extraction of embedded files to Apache Tika's commandline.

This will also extract earlier versions of PDFs available through incremental updates.

This feature is still in beta. Let us know what you think.

Details in next toot.

#fileforensics #districtcon #ipres2025
#helpwanted #digipres #fileformatology #ApacheTika

#fileforensics #districtcon #ipres2025 #helpwanted #digipres #fileformatology

Tim Allison @[email protected] · 2025-10-13 · 17:21 UTC

@Thorsted @mickylindlar

Thank you for sharing! I confirmed that #ApacheTika is correctly throwing an EncryptedDocumentException on that file. 🎉

#apachetika

Tim Allison @[email protected] · 2025-10-13 · 17:21 UTC

@Thorsted @mickylindlar

Thank you for sharing! I confirmed that #ApacheTika is correctly throwing an EncryptedDocumentException on that file. 🎉

#apachetika

Tim Allison @[email protected] · 2025-10-13 · 17:06 UTC

@Thorsted @mickylindlar

If different enough from our IRM unit test, would you mind sharing with me, too? :D

I want to make sure that we're correcty flagging these in #ApacheTika
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testMicrosoftIRMServices.pdf

#apachetika

Tim Allison @[email protected] · 2025-10-13 · 17:06 UTC

@Thorsted @mickylindlar

If different enough from our IRM unit test, would you mind sharing with me, too? :D

I want to make sure that we're correcty flagging these in #ApacheTika
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testMicrosoftIRMServices.pdf

#apachetika

Johannes Rabauer @[email protected] · 2025-10-07 · 13:30 UTC

🧠 Open-source & still evolving:
https://github.com/JohannesRabauer/quanta

What would you add next, smarter summaries, multi-agent explorers, or something wild?
#LangChain4j #Quarkus #Ollama #pgvector #ApacheTika #AI #Java #LLaMA3

#langchain4j #quarkus #ollama #pgvector #apachetika #ai

Tim Allison @[email protected] · 2025-09-26 · 19:45 UTC

Took the day "off" and made some progress on this for #ApacheTika

Files changed: 220 🤣

I think this is the last major change for Tika 4.x.

#apachetika

Tim Allison @[email protected] · 2025-09-26 · 19:45 UTC

Took the day "off" and made some progress on this for #ApacheTika

Files changed: 220 🤣

I think this is the last major change for Tika 4.x.

#apachetika

Tim Allison @[email protected] · 2025-09-26 · 16:55 UTC

Moving tika-pipes out of tika-core in #ApacheTika 4.x.

Please take a look and let me know what you think:

https://github.com/apache/tika/pull/2339

Ref: https://issues.apache.org/jira/browse/TIKA-4334

#apachetika

Tim Allison @[email protected] · 2025-09-26 · 16:55 UTC

Moving tika-pipes out of tika-core in #ApacheTika 4.x.

Please take a look and let me know what you think:

https://github.com/apache/tika/pull/2339

Ref: https://issues.apache.org/jira/browse/TIKA-4334

#apachetika

Tim Allison @[email protected] · 2025-09-16 · 17:40 UTC

I just noticed there are 1.3 million pulls of tika-server on Docker Hub per month.

That's A LOT of files parsed!

Happy parsing!

#ApacheTika #docker

#apachetika #docker

Tim Allison @[email protected] · 2025-09-16 · 17:40 UTC

I just noticed there are 1.3 million pulls of tika-server on Docker Hub per month.

That's A LOT of files parsed!

Happy parsing!

#ApacheTika #docker

#apachetika #docker

Tim Allison @[email protected] · 2025-09-16 · 16:51 UTC

Just submitted an "Intro to #ApacheTika" talk to DistrictCon, Year 1. 🤞

There will be PDFs!🤣

#districtcon @DistrictCon
https://sessionize.com/districtcon/

#apachetika #districtcon

Tim Allison @[email protected] · 2025-09-16 · 16:51 UTC

Just submitted an "Intro to #ApacheTika" talk to DistrictCon, Year 1. 🤞

There will be PDFs!🤣

#districtcon @DistrictCon
https://sessionize.com/districtcon/

#apachetika #districtcon

Tim Allison @[email protected] · 2025-09-11 · 15:32 UTC

#ApacheTika 3.2.3 release candidate #1 is up for vote!

This is a bugfix release that fixes a bug in processing XFA within PDFs via tika-server.

https://lists.apache.org/thread/px1stbwnbgx301y4sg6yxycrmcqt27gf

#apachetika

Tim Allison @[email protected] · 2025-09-11 · 15:32 UTC

#ApacheTika 3.2.3 release candidate #1 is up for vote!

This is a bugfix release that fixes a bug in processing XFA within PDFs via tika-server.

https://lists.apache.org/thread/px1stbwnbgx301y4sg6yxycrmcqt27gf

#apachetika

Tim Allison @[email protected] · 2025-09-06 · 11:41 UTC

I just learned about @DistrictCon 's CFP, deadline is Sep 28.

Anyone interested in #ApacheTika for file deep dives?

No matter the answer, please consider submitting your talks!

http://sessionize.com/districtcon

#apachetika

Tim Allison @[email protected] · 2025-09-06 · 11:41 UTC

I just learned about @DistrictCon 's CFP, deadline is Sep 28.

Anyone interested in #ApacheTika for file deep dives?

No matter the answer, please consider submitting your talks!

http://sessionize.com/districtcon

#apachetika

Tim Allison @[email protected] · 2025-09-05 · 16:12 UTC

This one is particularly rewarding for me w.r.t. #ApacheTika.

This shows the CRS trying to trigger a zip slip in our existing unpacker code. It couldn't, so it eventually found the vulnerable harness (and new class) that I added for the competition for this "entry level zip slip" challenge.

https://theori-io.github.io/aixcc-public/logs/#/view/tika_povs.1.log/140367160048976_1750810832.2368217

#apachetika

Tim Allison @[email protected] · 2025-09-05 · 16:12 UTC

This one is particularly rewarding for me w.r.t. #ApacheTika.

This shows the CRS trying to trigger a zip slip in our existing unpacker code. It couldn't, so it eventually found the vulnerable harness (and new class) that I added for the competition for this "entry level zip slip" challenge.

https://theori-io.github.io/aixcc-public/logs/#/view/tika_povs.1.log/140367160048976_1750810832.2368217

#apachetika

Tim Allison @[email protected] · 2025-09-04 · 11:50 UTC

I'll be back in the kitchen at the #ipres2025 Digital Preservation Bake Off with #ApacheTika (virtual).

Any requests?

https://www.ipres2025.nz/post/ipres-tools-demo-session-the-digital-preservation-bake-off

#digipres #digipresBakeOff

#ipres2025 #apachetika #digipres #digipresbakeoff

Tim Allison @[email protected] · 2025-09-04 · 11:50 UTC

I'll be back in the kitchen at the #ipres2025 Digital Preservation Bake Off with #ApacheTika (virtual).

Any requests?

https://www.ipres2025.nz/post/ipres-tools-demo-session-the-digital-preservation-bake-off

#digipres #digipresBakeOff

#ipres2025 #apachetika #digipres #digipresbakeoff

Tim Allison @[email protected] · 2025-08-20 · 20:06 UTC

Time to upgrade #ApacheTika to 3.2.2.

XXE in XFA parsing up through version 3.2.1

https://lists.apache.org/thread/8xn3rqy6kz5b3l1t83kcofkw0w4mmj1w

#apachetika

Tim Allison @[email protected] · 2025-08-20 · 20:06 UTC

Time to upgrade #ApacheTika to 3.2.2.

XXE in XFA parsing up through version 3.2.1

https://lists.apache.org/thread/8xn3rqy6kz5b3l1t83kcofkw0w4mmj1w

#apachetika

Tim Allison @[email protected] · 2025-08-11 · 13:36 UTC

@mhoye We've gotten similar on #ApacheTika , an open source file type detection and parser library.

What even?

#apachetika

Tim Allison @[email protected] · 2025-08-11 · 13:36 UTC

@mhoye We've gotten similar on #ApacheTika , an open source file type detection and parser library.

What even?

#apachetika

Tim Allison @[email protected] · 2025-08-06 · 18:07 UTC

#ApacheTika 3.2.2 release candidate #1 is up for vote!

https://lists.apache.org/thread/60zn1wyx4skm9b63f4x9p91hrr9lyh08

#apachetika

Tim Allison @[email protected] · 2025-08-06 · 18:07 UTC

#ApacheTika 3.2.2 release candidate #1 is up for vote!

https://lists.apache.org/thread/60zn1wyx4skm9b63f4x9p91hrr9lyh08

#apachetika

Ross Spencer @[email protected] · 2025-07-29 · 07:29 UTC

Turning NASA Wake-up Calls into data

by @beet_keeper

For a while back then I was into space flight again. Scientists, science communicators, and engineers were all excited for a new era of rocket launches and the potential unification of the human race as we look towards the future.

During that time I discovered Colin Fries’ work in the NASA History Division to document all NASA “Wake-up calls”. A wake-up call is simply a piece of music used to wake astronauts on missions, a different piece of music, daily, for the duration of the flight.

Take, for example, the last Space Shuttle mission (Space Transportation System) STS-135; it was in flight for 13 days, and the wake-up call on day one was Coldplay’s Viva la Vida, while on day 13 it was Kate Smith singing God Bless America.

As a huge music buff who has the radio or music television on 18 hours a day, I really wanted to delve into this further. While Colin’s work is great, it’s just a PDF file (@wtfpdf). A PDF is not an ideal file format for querying data and gleaning new insights. So, while I wanted to explore it, I first decided to turn it into a true dataset. The result was a set of resources, a website, a JSON, a CSV, and an SQLite database which are each more functional and more maintainable over time.

Lets take a look at the results and https://nasawakeupcalls.github.io below!

#ApacheTika #Code #Coding #DataWrangling #Datasette #DatasetteLite #DH #DigitalHumanities #glam #harkive #NASA #NASAWakeUpCall #NASAWakeUpCalls #OpenData #PersonalProjects #Science #Space #SpaceHistory #Twitter #WakeUpCall

#apachetika #code #coding #datawrangling #datasette #datasettelite

Tim Allison @[email protected] · 2025-07-10 · 13:49 UTC

#ApacheTika 3.2.1 is now available!

https://lists.apache.org/thread/slhbnqm9gks5vtt95g1oxc9hd33f302m

#apachetika

Tim Allison @[email protected] · 2025-07-10 · 13:49 UTC

#ApacheTika 3.2.1 is now available!

https://lists.apache.org/thread/slhbnqm9gks5vtt95g1oxc9hd33f302m

#apachetika

Tim Allison @[email protected] · 2025-07-09 · 20:37 UTC

Starting to work with #sleuthKit .

I was able to solve @xchatty's m57-jean problem in ~10 min after I built sleuthkit and ran #ApacheTika against two files that I guessed would be interesting. :D

https://digitalcorpora.org/corpora/scenarios/m57-jean/

I 😍 these open resources!

#sleuthkit #apachetika

Tim Allison @[email protected] · 2025-07-09 · 20:37 UTC

Starting to work with #sleuthKit .

I was able to solve @xchatty's m57-jean problem in ~10 min after I built sleuthkit and ran #ApacheTika against two files that I guessed would be interesting. :D

https://digitalcorpora.org/corpora/scenarios/m57-jean/

I 😍 these open resources!

#sleuthkit #apachetika