home.social

#fileforensics — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #fileforensics, aggregated by home.social.

  1. I recently added fully recursive extraction of embedded files to Apache Tika's commandline.

    This will also extract earlier versions of PDFs available through incremental updates.

    This feature is still in beta. Let us know what you think.

    Details in next toot.

    #fileforensics #districtcon #ipres2025
    #helpwanted #digipres #fileformatology #ApacheTika

  2. @Ange "filtering files quickly and possibly reliably" 🤣🤣🤣

    Thank you for sharing. This is a fantastic talk!

    #fileformatology #fileFormatGeekery #fileforensics

  3. Anyone in #fileforensics #forensics #digitalforensics willing to offer an informational interview?

    I'm trying to figure out if that would be a good fit.

    I have a substantial track record in open source communities and decent knowledge of file formats and some of the mayhem available. 😄

    #fedihire

  4. @decalage recently asked me if we had any files with Adobe LiveCycle's Usage Rights. I hadn't come across these before, but I think they'd have important implications for #fileforensics and #digipres

    Adobe's link: help.adobe.com/en_US/livecycle

    I opened issues.apache.org/jira/browse/ to track discussion of this.

    If you care about this topic or can offer technical advice on how to extract this info, please help!

    cc @PDFassociation

  5. I've gotten a bunch of #infosec followers over the last coupla days.

    For those interested in #fileforensics and especially PDFs, please take a look at our fairly newly released 8 million/8TB PDF corpus, derived from #CommonCrawl and then augmented by our team at #nasajpl

    digitalcorpora.org/corpora/fil

  6. The info in the tables includes file types for embedded files, depth of embedded files, language id and a bunch of other features, including the #outOfVocabulary statistic.

    These kinds of stats are really important for ingest for #search, #digipres and #fileforensics.

    Take a look at the *-1k.csv files, and I can share the config file that extracted that info if you have an interest.

  7. We recently published the results of running #ApacheTika on the corpus with an emphasis on PDF, erm, features.

    There are two tables: a) each row is a URL (for the primary/container PDF) and b) each row is a URL for the primary/container PDF OR an attachment within that PDF.

    #digipres #fileforensics

    downloads.digitalcorpora.org/c