#fileforensics — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #fileforensics, aggregated by home.social.
-
I recently added fully recursive extraction of embedded files to Apache Tika's commandline.
This will also extract earlier versions of PDFs available through incremental updates.
This feature is still in beta. Let us know what you think.
Details in next toot.
#fileforensics #districtcon #ipres2025
#helpwanted #digipres #fileformatology #ApacheTika -
@Ange "filtering files quickly and possibly reliably" 🤣🤣🤣
Thank you for sharing. This is a fantastic talk!
-
Anyone in #fileforensics #forensics #digitalforensics willing to offer an informational interview?
I'm trying to figure out if that would be a good fit.
I have a substantial track record in open source communities and decent knowledge of file formats and some of the mayhem available. 😄
-
@decalage recently asked me if we had any files with Adobe LiveCycle's Usage Rights. I hadn't come across these before, but I think they'd have important implications for #fileforensics and #digipres
Adobe's link: https://help.adobe.com/en_US/livecycle/11.0/Services/WS92d06802c76abadb-6ec569c512dbeb3d9d6-7ffd.2.html
I opened https://issues.apache.org/jira/browse/TIKA-4168 to track discussion of this.
If you care about this topic or can offer technical advice on how to extract this info, please help!
-
I've gotten a bunch of #infosec followers over the last coupla days.
For those interested in #fileforensics and especially PDFs, please take a look at our fairly newly released 8 million/8TB PDF corpus, derived from #CommonCrawl and then augmented by our team at #nasajpl
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/
-
The info in the tables includes file types for embedded files, depth of embedded files, language id and a bunch of other features, including the #outOfVocabulary statistic.
These kinds of stats are really important for ingest for #search, #digipres and #fileforensics.
Take a look at the *-1k.csv files, and I can share the config file that extracted that info if you have an interest.
-
We recently published the results of running #ApacheTika on the corpus with an emphasis on PDF, erm, features.
There are two tables: a) each row is a URL (for the primary/container PDF) and b) each row is a URL for the primary/container PDF OR an attachment within that PDF.
https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/metadata/