#fileforensics — Public Fediverse posts on home.social

Tim Allison @[email protected] · 2025-10-15 · 10:53 UTC

I recently added fully recursive extraction of embedded files to Apache Tika's commandline.

This will also extract earlier versions of PDFs available through incremental updates.

This feature is still in beta. Let us know what you think.

Details in next toot.

#fileforensics #districtcon #ipres2025
#helpwanted #digipres #fileformatology #ApacheTika

#fileforensics #districtcon #ipres2025 #helpwanted #digipres #fileformatology

Tim Allison @[email protected] · 2024-11-05 · 14:45 UTC

@Ange "filtering files quickly and possibly reliably" 🤣🤣🤣

Thank you for sharing. This is a fantastic talk!

#fileformatology #fileFormatGeekery #fileforensics

#fileformatology #fileformatgeekery #fileforensics

Tim Allison @[email protected] · 2024-01-17 · 13:04 UTC

Anyone in #fileforensics #forensics #digitalforensics willing to offer an informational interview?

I'm trying to figure out if that would be a good fit.

I have a substantial track record in open source communities and decent knowledge of file formats and some of the mayhem available. 😄

#fedihire

#fileforensics #forensics #digitalforensics #fedihire

Tim Allison @[email protected] · 2023-11-09 · 16:59 UTC

@decalage recently asked me if we had any files with Adobe LiveCycle's Usage Rights. I hadn't come across these before, but I think they'd have important implications for #fileforensics and #digipres

Adobe's link: https://help.adobe.com/en_US/livecycle/11.0/Services/WS92d06802c76abadb-6ec569c512dbeb3d9d6-7ffd.2.html

I opened https://issues.apache.org/jira/browse/TIKA-4168 to track discussion of this.

If you care about this topic or can offer technical advice on how to extract this info, please help!

cc @PDFassociation

#fileforensics #digipres

Tim Allison @[email protected] · 2023-07-20 · 14:00 UTC

I've gotten a bunch of #infosec followers over the last coupla days.

For those interested in #fileforensics and especially PDFs, please take a look at our fairly newly released 8 million/8TB PDF corpus, derived from #CommonCrawl and then augmented by our team at #nasajpl

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

#infosec #fileforensics #commoncrawl #nasajpl

Tim Allison @[email protected] · 2023-07-18 · 20:49 UTC

The info in the tables includes file types for embedded files, depth of embedded files, language id and a bunch of other features, including the #outOfVocabulary statistic.

These kinds of stats are really important for ingest for #search, #digipres and #fileforensics.

Take a look at the *-1k.csv files, and I can share the config file that extracted that info if you have an interest.

#outofvocabulary #search #digipres #fileforensics

Tim Allison @[email protected] · 2023-07-18 · 20:45 UTC

We recently published the results of running #ApacheTika on the corpus with an emphasis on PDF, erm, features.

There are two tables: a) each row is a URL (for the primary/container PDF) and b) each row is a URL for the primary/container PDF OR an attachment within that PDF.

#digipres #fileforensics

https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/metadata/

#apachetika #digipres #fileforensics