home.social

#wdpd — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #wdpd, aggregated by home.social.

  1. Bonne journée mondiale de la préservation numérique !

    Saviez-vous qu’en 2024 et 2025, l’Université de Victoria et l’Université du Manitoba sont devenues les premières institutions Borealis à faire certifier leurs collections comme étant des dépôts de données fiables par CoreTrustSeal ?

    Cette certification signifie que ces collections répondent aux normes les plus élevées de préservation et aux meilleures pratiques en matière de gestion des données, garantissant que ces données seront découvrables et réutilisables à long terme.

    Au cours de la dernière année, nous avons travaillé sur le développement d’une suite de documentation CoreTrustSeal pour aider d’autres institutions Borealis à obtenir la certification.

    Cette suite de documentation est maintenant disponible sur SPOTDocs, le wiki Borealis : oculsp.ca/ajqz1

    #WDPD #WDPD2025

  2. Happy World Digital Preservation Day!

    Did you know that in 2024 and 2025 the University of Victoria and the University of Manitoba became the first Borealis institutions to have their collections certified as Trustworthy Data Repositories by CoreTrustSeal?

    This certification means that these collections meet the highest standards of preservation and best practices in data management, ensuring the data will be discoverable and reusable for the long term.

    Over the past year, we have been working on developing a CoreTrustSeal Documentation Suite to support other Borealis institutions in pursuing certification.

    This documentation suite is now available on SPOTDocs, the Borealis wiki: oculsp.ca/ajqz1

    #WDPD #WDPD2025

  3. Happy World Digital Preservation Day!

    As the eresource landscape continues to shift, Scholars Portal is committed to supporting Canadian libraries with long-term preservation and perpetual access to their resources, even after termination of vendor agreement.

    Preservation rights for eresources hosted by SP include local load, perpetual access, and transformation rights, allowing us to migrate content to other formats as technology becomes obsolete, extending access long into the future.

    We receive independent copies of journal content from publishers, which we then host on servers at the University of Toronto Libraries and preserve in our Trustworthy Digital Repository (TDR).

    You can learn more about how SP supports the long-term preservation of eresources in our 2024-2025 Annual Report and in the Spotlight section of our September 2025 Newsletter, contributed by eresource staff at Ontario libraries.

    Annual Report: oculsp.ca/oqrc9
    Newsletter: oculsp.ca/459pm

    #WDPD #WDPD2025

  4. PRONOM’s dustiest records

    NB. because of the complexity of this post, it may be easier to read in original blog form, than on Mastodon here: https://exponentialdecay.co.uk/blog/pronoms-dustiest-records/

    Tyler’s recent blog post for the PRONOM Hack-a-thon Week 2024 (my previous for this week), brought up an interesting point about two of PRONOM’s oldest outline records, Real Video Clip (fmt/204) and Real Video (x-fmt/277). How did they end up in PRONOM?

    Tyler suggests:

    I assume PRONOM originally added these based on MIME types available.

    I thought I knew the answer, and so it prompted a forensic look at the records to see if what I thought I knew aligned with reality!

    As a PRONOM maintainer at The National Archives, UK from 2009-2012 I knew a little bit of the history of the system, we see some of that history impact us today, for example, when we look at the number of records that don’t have descriptions or file format signatures, 156 of those records are so-called x-PUIDs. A mechanism in PRONOM that was never meant to make it into the wild for working on file formats internally without polluting the public record. There are 455 x-PUIDs in total. They made it into the wild anyway (before my time) and so they exist as a symbol of PRONOM’s dustiest oldest records.

    Even by the time I had started, PRONOM still had a lot of what we started to call outline records. One of the more positive changes we made to the process back in the day was that we would stop creating outline records; instead, we would focus on records that could be tied to signatures. This didn’t necessarily make the records more correctly aligned with reality, but it meant records had utility and file formats identified by DROID could be tied back to something that PRONOM “knew about”. I believe the process is a bit more flexible these days, allowing individuals to contribute information to records that tie them back to information like MIMEtypes and specifications. It’s clearer the format is “real” even if a signature is yet to be developed (and of course there are a large number of data formats that are hard to even represent in traditional PRONOM signatures any more and so they need a record, even if there isn’t a neat concept of a signature for them).

    Okay old-man, but what about Tyler’s thesis?

    Stellent and PRONOM

    I learned sometime in my tenure at The National Archives that PRONOM had been seeded with a lot of the formats listed in a technology called OutsideIn previously owned by Stellent and now owned by Oracle.

    Oracle OutsideInhttps://docs.oracle.com/outsidein/853/oit/OutsideIn (2010)https://web.archive.org/web/20101016164937/http://www.oracle.com/technetwork/middleware/content-management/oit-all-085236.htmlData sheet – Formats (2011)https://web.archive.org/web/20110125024733/http://www.oracle.com/technetwork/middleware/content-management/ds-oitfiles-133032.pdfCOPTR entryhttps://coptr.digipres.org/index.php/Oracle_Outside_In_Technology

    I had always had a feeling that that the scope of this list was largely exaggerated by the company selling the software as it is a marketing tool; and if not exaggerated, perhaps, just not as clearly delineated by format than PRONOM, and rather, by Software, regardless of the properties of a given “format”, e.g. WinZip, and PKZip.

    Back to the story though, I was also reasonably sure I would find Tyler’s RealVideo formats in the format listing but, I did not!

    I downloaded a CSV summarizing the PRONOM records from api.pronom.ffdev.info with:

    curl -X 'GET' \
     'https://api.pronom.ffdev.info/pronom_summary_csv' \
     -H 'accept: application/csv'

    I filtered on outline entries and those without signatures only. I went through the entries still remaining and looked for name matches. I did find some name-for-name matches and some that were close, but no RealVideo or RealVideo Clip.

    The matches:

    7-bit ANSI Textyes7-bit ASCII Textyes8-bit ANSI Textyes8-bit ASCII TextyesEBCDIC-USyesFramework Database IIIyesIBM DisplayWrite Document 2yesIBM DisplayWrite Document 3yesMicrografx Designer 3.1yesNota Bene Text FileyesUnicode Text Fileyes

    The maybes:

    Cascading Style SheetmaybeFreelance File 1.0-2.1maybeMacPaint GraphicsmaybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Works DatabasemaybeMicrosoft Works Database for DOS 2.0maybeMicrosoft Works Database for Windows 3.0maybeMicrosoft Works Database for Windows 4.0maybeProfessional Write Text FilemaybeWordPerfect for Windows Document 5.2maybeXYWrite DocumentmaybeXYWrite Document IIImaybeXYWrite Document III+maybe

    11 exact matches! It’s hardly a headline!

    I had hoped that if I found more exact matches it would provide some clues to where some of the older PRONOM entries came from. I expected most of the outline records to come from this list, alas, it isn’t nearly as many as anticipated.

    I hoped too that going through the list I might get more clues as to formats that could potentially be deprecated in PRONOM.

    As it stands, from the OutsideIn list, the only records I would personally recommend for deprecation are:

    7-bit ANSI Text7-bit ASCII Text8-bit ANSI Text8-bit ASCII TextEBCDIC-USUnicode Text File

    We know enough now to be almost certain that if something that looks like these files arrives in the archive it will present as a standard text file, and that we will need to rely on determining the character encoding using tools such as Richard Lehane’s characterize (see characterize’s README for more background). It is unlikely we will be able to attach a signature to these records, and we know there are a great deal more encodings in the world than need be represented as PRONOM identifiers.

    NB. this might be something to formalize in a PRONOM decision making rubric, connected also, to formalizing approaches for XML based signatures.

    A bit of a let down, or is it?

    Still uncomfortable with so many outline records and little provenance for them, I wanted to find more information about the source of PRONOM data and so I decided to take a different path — I surfed the internet for answers!

    Out of the list of outline records I found a few to be overly specific, or slightly weird, i.e. not really things we hear much about day-to-day, some examples:

    ACBM GraphicsApple SoundAutoCAD Plot Configuration File 1.0-R13AutoCAD Plot Configuration File R14AutoSketch DrawingBtrieve Database 5.1CorelDraw PatternDEC Data Exchange FileDEC WPS Plus DocumentDr Halo BitmapGeneric Library FileHTML Extension FileHewlett Packard AdvanceWrite Text FileInkwriter/Notetaker TemplateInset Systems BitmapInstalit ScriptInterleaf DocumentMicrosoft Excel Add-InMicrosoft Excel ODBC QueryMicrosoft Excel ToolbarMicrosoft Powerpoint Design TemplateMicrosoft Print FileMicrostation CAD Drawing 95NAP MetafileNota Bene Text FileOS/2 Change Control FileRevit External GroupSAP DocumentSAS Data FileScanstudio 16-Colour BitmapSchedule+ ContactsSpeller Custom DictionaryUnisys (Sperry) System Data FileWordperfect Secondary File 5.0Wordperfect Secondary File 5.1/5.2form*Z Project File

    ACBM graphics? Dr Halo Bitmap? Btrieve database, “5.1”? where are the other five?!!

    It gave me pause. I didn’t believe these were all formats well-known to folks who created PRONOM, and I know we didn’t have such an advanced digital transfer program at the time that meant agencies were submitting huge variations of formats to PRONOM for future preservation.

    I felt they had to come from somewhere, but where?

    Enter Filext.com

    Because these formats were very specific I found listings on the internet that I knew had to be part of the story. I had immediate luck just looking for combinations of these names, e.g. ACBM Graphics + NAP Metafile.

    In particular I found listings on different websites from hobbyists or universities that all looked the same or similar, e.g.

    There were definite matches with PRONOM which we will get to, but I started to wonder about the provenance of these extensions.

    I kept looking and I found one clue, a header and footer of a file that looked like those above and read as follows:

    Copyright © 2002 Computer KnowledgeAll Rights ReservedThis download for personal use only. Do NOT distributeit to others either alone or incorporated into anysoftware without prior permission from Computer Knowledge.Developers who wish to incorporate portions of the listplease see the comments at the end of this file.
    Developer permissions....This total file may not be included in any other software orproject which presents the data to the public or portions ofthe public. Any developer who wishes to include up to (butnot more than) 2,000 individual entries from this file is freeto do so provided certain conditions are met. These are:.  1) Credit must be given to FILExt. If links are available  in the developed product then one must also be provided to  FILExt as http://filext.com..  Suggested text: "File extension list courtesy of FILExt.  For a more extensive list visit http://filext.com.".  2) Once the extensions are chosen for one product by any  developer then these same extensions must continue to be  used by that developer for any other projects (i.e., you  cannot take one set of 2,000 for one project and a different  set of 2,000 for another project; it's a total of 2,000)..  3) If links are available in the developed product then any  links appearing associated with any of the 2,000 picked  extensions must be included in the product. (This covers  future plans to include such links in this list.).When the project is complete please notify FILExt with thespecifics at [email protected]. We're always interestedin how the list is being used. Thank you.

    Filext.com!

    And so I asked myself, how long had filext been around?

    As it turns out, quite a while! It was forked from a site called cknow around 2002. cknow.com was registered around 1996 and filext.com registered in 2001.

    The first appearance of cknow in the internet archive is late 1996: https://web.archive.org/web/19961219035827/http://www.cknow.com/ and Filext early 2001: https://web.archive.org/web/20010522235126/http://www.filext.com/

    The sites were founded by Tom Simondi. It looks like he has been responsible for a lot of the 90s and 00s work around demystifying extensions and getting more information to folk about what to do with them.

    Could it be the source of the first PRONOM records?

    Comparing some of the many other text-based lists I had found with cknow and filext gave me some confidence that there was some shared heritage with the them, and so I asked, could the cknow and filext lists have also seeded PRONOM?

    I picked a list close to 2002 (cknow Extensions: 2000) when PRONOM was first started and began to compare entries for exact matches.

    ACBM GraphicsyesAutoCAD Compiled MenuyesAutoSketch DrawingyesBtrieve Database 5.1yesDataFlex Query Tag NameyesDeluxe Paint bitmapyesDesignCAD DrawingyesDigital VideoyesDr Halo BitmapyesFrame Vector MetafileyesFramework Database IIyesFramework Database IIIyesFramework Database IVyesInformation or Setup FileyesInset Systems BitmapyesInterBase DatabaseyesLotus Approach View FileyesMathematica NotebookyesMicrosoft Excel Add-InyesMicrosoft Excel ODBC QueryyesMicrosoft Excel OLAP QueryyesMicrosoft Excel OLE DB QueryyesMicrosoft Excel Web QueryyesMicrosoft FoxPro LibraryyesMicrosoft Outlook Address BookyesMicrosoft PowerPoint Graphics FileyesMicrosoft Powerpoint Add-InyesMicrosoft Visual FoxPro TableyesMicrosoft Works DatabaseyesMicrosoft Works DocumentyesMicrostation CAD Drawing 95yesNAP MetafileyesNota Bene Text FileyesOS/2 Change Control FileyesPICS AnimationyesPageMaker Document 3.0yesPageMaker Time Stamp File 4.0yesProfessional Write Text FileyesQuicken Data FileyesRealVideo Clip <– cc. Tyler!yesSchedule+ ContactsyesStatGraphics Data FileyesStructured Query Language DatayesVentura Publisher Vector GraphicsyesXYWrite Document IIIyesXYWrite Document IVyes

    46 matches!

    Apple SoundmaybeAutoCAD Device-Independent Binary Plotter FilemaybeAutoCAD Drawing TemplatemaybeCascading Style SheetmaybeDEC Data Exchange FilemaybeDEC WPS Plus DocumentmaybeFreelance File 1.0-2.1maybeJava Servlet PagemaybeMicrografx Designer 3.1maybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Office Binder Template for Windows 95maybeMicrosoft Office Binder Template for Windows 97-2003maybeMicrosoft Office Binder Wizard for Windows 95maybeMicrosoft Office Binder Wizard for Windows 97-2003maybeVentura PublishermaybeXYWrite DocumentmaybeXYWrite Document III+maybe

    17 maybes!

    What did we answer?

    Okay, 46 exact matches does not the full listing make (although many (now) full-entries may still have been made from these early listings). Filext may have been an important resource for the first PRONOM records, but it’s also likely that PRONOM had other sources of information. For example, for a number of the Microsoft formats with outline records read like export or save-as listings in previous versions of Microsoft software. E.g. Excel:

    NB. I wasn’t actively researching this side of things writing this blog, but I can already see some commonalities, especially Unicode Text!

    I know we also had a copy of the Dr Dobb’s Essential Books on File Formats CD-ROM in the archive, and so that may also have been an important resource when PRONOM was creating its first records.

    I count only two overlaps with the Stellent list, Framework Database III and Nota Bene Text File.

    We did, however, find the RealVideo Clip! And I think we found some decent correlation with a resource that looks likely to have been used partially to populate the PRONOM database.

    The era of file extensions

    • Throughout my research, I found a lot of similar websites. Filext seems to go furthest back and has the greater pedigree, but in the noughties a lot of other sites seemed to appear to try and provide similar information to internet users, a few of note that seemed comprehensive and particularly well presented:

    I am sure we looked at these sites during my time on PRONOM, although with less frequency given the need to reduce outline records and increase the number with actionable information.

    NB. I also  learned that TrID has been around since 2003! https://web.archive.org/web/20030612031252/http://mark0.ngi.it:80/

    Provenance and prior art

    It’s not entirely productive to say I wish we had better provenance for PRONOM records back in the day – but I do!

    It makes me reflect on the importance of looking outside of our own walls in digital preservation instead of the constant redundancy of reinvention or ownership.

    Often as academics, or those with archival views of the world, we can provide a polish and precision to technology as it exists to make it more usable in an archival context.

    But cknow has been around so long, and the Unix utility File was created in 1986.

    There’s a parallel history here that we should be recognizing and sharing for our next colleagues.

    I arrived at TNA in 2009 and learned about File maybe two years later. As a Windows guy at the time, that might not be uncommon, but I do feel it is on me to have known more. I also think it should have been trivial to access the provenance around some of the records in the database at the time, but more than that – as a field, shouldn’t we all know Tom Simondi? What if the same academic rigour of PRONOM and DROID could have been applied to existing tools like File? What if we had expanded our bubble and recognized digital preservation (or the tools for it) is something people have been doing in all but name for the longest time? What if the people working in parallel on these projects and websites were part of the digital preservation inner-circle community today?

    I don’t have answers, but I feel there are lessons there for the future. Not reinventing or rebuilding without good reason is important, but even if we build something new and we have been inspired by something else, continuing to recognize and acknowledge prior art is important.

    What do you think?

    Also, how do we get these people into a room and celebrate their work, and learn more!

    What next?

    I don’t think I got very far here but I found it interesting, and I hope other readers may as well.

    This is meant to be a PRONOM hack-a-thon blog and I don’t know if I have pushed the sticks forward that much but maybe there’s a bit more to reason about in the outline records, for example, around the plain-text formats mentioned above and a few more identified along the way.

    7-bit ANSI Textx-fmt/21Recommend deprecation7-bit ASCII Textx-fmt/22Recommend deprecation8-bit ANSI Textx-fmt/282Recommend deprecation8-bit ASCII Textx-fmt/283Recommend deprecationUnicode Text Filex-fmt/16Recommend deprecationEBCDIC-USfmt/159Recommend deprecationMS-DOS Text File with line breaksx-fmt/130Recommend deprecation

    I noticed in the outline entries some low-hanging fruit that I might focus on next opportunity if someone else doesn’t get there first, these would be:

    Cascading Style Sheetx-fmt/224Consider adding CSS to the record nameA signature should be feasibleDocument Type Definitionx-fmt/315Consider adding DTD to the record nameA signature should be feasibleExtensible Stylesheet Languagex-fmt/281Consider adding XSL to the record nameA signature should be feasibleHTML Extension Filex-fmt/417Related to Microsoft’s ISS serverA signature may be possibleStandard Generalized Markup Languagex-fmt/195Consider adding SGML to the record nameA signature may be possibleStill Picture Interchange File Format 2.0fmt/113Related to JPEGA signature should be possibleStructured Query Language Datafmt/206Consider adding SQL to the record nameA signature may be possibleDreamweaver Lock Filefmt/335A system file, there may be an entry in the NSRL databaseA signature may be possible

    A little more on the history of extensions websites

    The complete filext text file (allext.zip)

    It took a few jumps, but I found the complete downloadable text file from Filext.com. I don’t think it exists any more and I don’t think the internet archive managed to grab a copy. Apparently it was quite a chunk of data to download on the web once upon a time, but they eventually found a way to release a zipped text file:

    Via one jump we get to the “whole list” page:

    https://web.archive.org/web/20020605164206/http://filext.com/wholelist.htm

    And then to confirm our absolute interest in downloading it, we get to the a2z file:

    https://web.archive.org/web/20020606071418/http://filext.com/a2z.htm

    Which would have taken us to the zip file, alas, never captured on the Internet Archive anyway, maybe it is on other Memento compatible servers:

    https://web.archive.org/web/20060117000000*/http://www.filext.com:80/allext.zip

    Keeping filext up to date

    Filext still asks for registry data to help keep it up to date. That’s pretty cool!

    https://filext.com/faq/gather_data_for_filext.html

    1 │ Echo OFF
    2 │ CLS
    3 │ assoc > filext_submission_output.txt
    4 │ Echo ---------- >> filext_submission_output.txt
    5 │ ftype >> filext_submission_output.txt
    6 │ Echo Thank you. The output file has been created and
    7 │ Echo named filext_submission_output.txt and it should
    8 │ Echo be in the same place where you saved this batch
    9 │ Echo file. All that is left now is to send that file
    10 │ Echo to FILExt. Attach it to an E-mail sent to the
    11 │ Echo address: [email protected]
    12 │ Echo The E-mail subject should be: Submission
    13 │ Echo Thank you.
    14 │ Pause
    15 │ Exit

    Filext as a source of learning

    The filext faqs and community seemed particularly helpful and interesting back in the day:

    https://web.archive.org/web/20090322040812/http://filext.com/faq/

    File extension aggregator

    The file-extension.net website started an aggregator project around 2007 and it’s still running today!

    http://file-extension.net/seeker/

    Some bonus images…

    As I was working on this, I found irony in Google Sheets glitching, I managed to grab some screenshots along the way. Thanks for reading everyone!

    #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #PRONOM #WDPD #WDPD2024

  5. simpledroid: completing the circle

    It’s nearing the end of 2024 and that must mean a PRONOM hackathon as part of the World Digital Preservation Day (#WDPD2024).

    My contribution is a follow-up on my work earlier in the year to produce a valid DROID signature file from Wikidata in wddroidy.

    simpledroid is available on GitHub and creates a simple DROID signature file from PRONOM itself, creating a scripted pathway to create a signature file using official PRONOM data that doesn’t require the current PRONOM database and its legacy stored procedures.

    It also does away with a lot of the excess data in the current DROID signature file which was previously an optimization for its Boyer Moore Horspool search algorithm, as described by Matthew Palmer.

    The primary reason for simpledroid was to complete the circle on my previous efforts and to prove that it was possible to create a simplified signature file and for it to work with DROID. The result is about 80-90% there, with only a few skeleton files that remain unidentified – it should only require a small amount of forensic research to determine the reason.

    The output provides a way for simplifying the signature file generation process, offering new opportunities to create alternative versions, or filtering what’s already there, e.g. filtering out any signatures that aren’t explicitly for image identification, e.g. in a digitization workflow.

    It may provide another way into PRONOM data for those who might look at DROID first as well as opening up different ways to modify and test signatures.

    It is possible to see in the reference output, that the signatures are much easier to understand via this simplified DROID file.

    simpledroid outputs a file with a smaller footprint than the current file:

    1.2M DROID_SignatureFile_Simple_2024-11-11T12-29-22Z.xml
    3.4M DROID_SignatureFile_V118.xml

    It also contains all of the file classification data e.g. FormatType="Video" from PRONOM that will be added into DROID in a future release (and is already available in Siegfried).

    Unlike the wddroidy work, priorities have also been added to the signature file so the mechanics of the signature file are pretty close to the official version (DROID uses the signature sequence and offsets to identify a file, but it then uses a priority to determine what results to display to the user where there may otherwise be positive matches for formats that provide the foundation for another, e.g. how XML forms the basis of SVG or XHTML.

    It might be possible to remove some data around minimum and maximum offsets in the new file after discovering that simplified droid syntax requires curly bracket syntax at the beginning and end of sequences to mimic the same behavior, e.g.

    With a BOFoffset, min_offset = 2, and signature = BADF00D1, the signature needs to become {2}BADF00D1 to work.

    The code is pretty straightforward and uses a few tricks to output XML sensibly without having to build the document’s tree (DOM) in a more verbose way. There are probably a few other shortcuts I’d fix with time if the code was ever useful, including improving variable naming and adding tests.

    I’m not sure this code will ever be needed, or used by anyone, but for a quick hack and a quick proof of concept, it felt good to put it out there. Maybe someone will look at this or the wddroidy work and see there may be a way to federate different sources of signature information together into something DROID can use. Or it might be a useful demonstration to the DROID team that allows them to simplify PRONOM’s database and output mechanisms in a way that remains compatible with existing tools.

    Previous research week work

    My previous work for PRONOM research week includes a dashboard and API for getting more information out of PRONOM, including listings of those records still requiring descriptions or signatures. You may find that work interesting and it is available at https://pronom.ffdev.info and https://api.pronom.ffdev.info.

    And if you want to get in on the signature development work, signature development utility 2.0 (https://ffdev.info) was also a previous effort of mine for research week 2020 and will hopefully also benefit from outputting DROID’s simplified syntax.

    A week of file formats

    Of course with World Digital Preservation Day, file formats were pretty popular.

    Andrew Jackson attempted to calculate how many distinct formats might be out there using methods used to calculate ecological diversity.

    Amanda Tome described the scope of their work and shared a number of useful resources including useful links to the PRONOM starter pack and to the PRONOM drop-in sessions.

    You might also find out a bit more about yourself by playing this File Format Dating Game from Lotte Wijsman and colleagues: Susanne van den Eijkel, Anton van Es, Elaine Murray, Francesca Mackenzie, Ellie O’Leary, and Sharon McMeekin. (I ended up on a date with FASTA (FDD000622) in my first play-through!)

    Not specifically for WDPD, but in the same week I also enjoyed this presentation from Ange Albertini looking at different ways of identifying file formats. One big take away for me was thinking about how to get more forensic information out of a file format identification. DROID doesn’t tell us a lot, but is there a world in which one day it could?

    Let me know if you find any of this work useful at all; and good luck on your file format endeavors this week.

    #digipres #DigitalPreservation #DROID #FileFormats #PRONOM #Python #siegfried #SkeletonTestCorpus #WDPD #WDPD2024

  6. I got a response to my paper PREMIS Events Through an Event-Source Lens.

    There are two strange choices made by this response. I’ll touch on the more personal one at the end, but first, what does the response say?

    It’s not entirely clear. 

    If the response says that, “it is a choice to implement PREMIS?” And that “PREMIS can be implemented in different ways?” “and that it’s technology agnostic” Then yes, 100% that’s basically the driver for my original paper and once you read it holistically, instead of dissecting it and cherry-picking points, you will probably read it that way as well.

    As I wrote in my first blog response to the publication of my paper in 2023, Tessella’s Rob Sharpe’s 2013 presentation was an important reference point for me and we’ll revisit it below, but Rob labors that PREMIS is technology agnostic and can be represented in other formats, and since 2013 I haven’t seen enough conversation or discussion about that, and I wanted to amplify that message by looking at PREMIS in an event-sourced model as an aggregation. 

    If there’s something more substantive in the PREMIS Editorial Committee’s (EC) response, then I feel it’s lost in its own stylistic choices (to focus on what I might have been saying rather than taking a show don’t tell approach to clarifying their more salient points.).

    I wonder if it might have been handled differently? I am pretty easy to find these days, and so reaching out to clarify any of my thinking might have been one way; perhaps there was a way to collaborate on a response; perhaps most of of the EC’s concerns (if there are any) could have been handled with a joint editorial note in the original paper to clarify that my words are not an authoritative source on PREMIS, rather, PREMIS (events) were largely a vehicle to describe more the benefits of an event-sourced architecture and that you still need to consider and interpret the PREMIS documentation and guidance for yourself before implementing it in your own solutions.

    Going a different direction

    The essence of the original paper is this: (from my perspective) PREMIS is not a schema to be implemented in the back-end of any digital preservation system. Should it be still be deemed a relevant technology, it might be studied in your requirements analysis, and you would make sure that your own system is not lossless in any way as to effect PREMIS “conformance”, but you would not match your “schema” to PREMIS, you would ensure that you can output it, “present it” that is, it would become one representation of data that can be generated from your system out of many. One view, or as I clearly point out, an aggregation, in the case we have chosen an event-based architecture. 

    This is not at odds with the (so-called) corrections that have been provided to me in the Code4Lib journal article from the PREMIS EC.

    That being said, a further thesis is that PREMIS events are often a lossy, stateful representation of data in a digital preservation system. PREMIS represents one-dimensional state (or slices of state) over a period of time. In the modern engineering world, we have at our disposal methods of capturing, greedily, all events in the life of a digital object and doing that will create a richer view of the life of that object, and, as a representation of that data, a richer PREMIS view of an object and its events over time if so desired.

    The authors of the EC response labor heavily on their perception of a misunderstanding on my part about PREMIS and they can choose to do that but what may look like a misunderstanding of PREMIS is not a misunderstanding of technology:

    Conformance, in general, is defined as:

    > how well something, such as a product, service or a system, meets a specified standard

    And the PREMIS EC have decided to attach levels to conformance (also graduated levels, and degrees) to “quantify(ing) the degree to which PREMIS has been implemented”, three of which are anchored in implementation, apparently, three distinct implementations.

    1. Mapping, indirect or otherwise,
    2. Export,
    3. Direct implementation,

    I write:

    PREMIS conformance should be separate from representation. If we acknowledge PREMIS is at least one important representation of preservation metadata, i.e. for its ability to act as an interface to those looking to interpret preservation metadata, then whether it exists logically on disk, or is generated through an event sourced projection, is irrelevant. How a representation complies with the PREMIS data model remains of greater importance, but this is measured from the same eventual view, whatever intermediate abstraction it sits within.

    The PREMIS EC can choose to have three graduated levels of implementation to quantify degree of implementation. They can also make it clear level three (internal representation) is not necessarily the final goal, but it might benefit you; but If you’re not the PREMIS EC, don’t go near it, there’s no need. 

    I posit that conformance is only how well you can map to PREMIS or access something PREMIS-like that satisfies its data model. Your goal is to look at PREMIS as one interface you can potentially satisfy (you still need to describe objects uniquely; you need to describe agents engaging with them; rights need to sit somewhere), and once you can satisfy that interface you can access it in many different ways, and conformance should be measured against that, if PREMIS conformance is deemed valuable.

    Put simply, conformance does not require levels. Levels may simply be the wrong word, these are just guides you might follow to demonstrate conformance (or ways that someone might audit a system to determine conformance).

    The EC clipped this from one of the points they responded to:

    Is level three (internal implementation) reasonable in today’s software development world, is it reasonable in today’s environmental climate?

    Do we sacrifice the potential to store and access other different, richer, more-complex, (or less-complex), representations about other cross-sections of our data at the expense of putting PREMIS at the core of our digital preservation system? – No. We can make it an output of many, and use its schema and data dictionary to output it, but we don’t build around it, we essentially report around it. 

    They argue: 

    there are also benefits in choosing to take an internationally defined and agreed data model and use that as the basis of your system. 

    Well, if it’s internationally defined and agreed, let’s just do that! 🤷

    The benefits of not implementing an external data model are broadly around increased control and flexibility, however the trade-off to consider is the likely loss of easy interoperability and exchange with other systems.

    If you re-frame PREMIS as an interchange-format and you can prove that as useful, you absolutely have my buy-in and I will have designed you a system that doesn’t preclude a PREMIS-like output, i.e. a way of aggregating more detailed information in your system and outputting PREMIS as a representation (a format) for others to understand.

    The resurgence of OAIS?

    From the EC: 

    There are two responses to this, the first is to note that access has always been considered a part of Digital Preservation, to the point that one of the functional areas of the OAIS model is Access.

    Who had OAIS on their World Digital Preservation Day (WDPD) Bingo Card? 

    But also, no. This is a misleading read and deserves more context.

    Access when it is considered part of digital preservation is when access is used as a measure of success of digital preservation (or indicator of the potential obsolescence of an object) – it is an intrinsic property of digital preservation. 

    But the access function in OAIS is not that. And even if you’re crafty, and build an access component to a system that provides a feedback loop to digital preservation functions, it’s not that part of OAIS.

    Now, PREMIS does have some nice features that support access BUT we’re talking “events”, and information that supports digital preservation and even though there may be a way to encode events that provide a feedback loop to measure the success of preservation, e.g. {“event”: “access”, “detail”: “tried to open PSD in GIMP”, “outcome”: “FAIL”}, true access goes well beyond the scope of my article and the spirit in which it was written.

    We need to evolve

    The EC presents a somewhat dogmatic and institutionalised response. As a flaneur in the field, as someone who has worked implementing PREMIS in one of the most PREMIS heavy digital preservation systems out there, and involved too in efforts to minimise PREMIS verbosity, including my own event-like approaches I revisit Sharpe’s paper in 2022/2023. I do this asking, why don’t we talk about it more? Why do I see projects today still see XML as the end goal of PREMIS?

    My view is that a 20 year old standard, a 2015 specification (last revision) and a 2016 reference implementation in an out of date technology (XML), and an very institutional PREMIS EC, with roots at the Library of Congress, all have influence, and some of the points I do see appearing from their response are being buried in their desire to hold onto authority.

    The biggest point being buried, technological agnosticism, appears in the EC’s response to me five times, technology independent once, and in the official data dictionary once (unrelated), and it appears in the official 2015 conformance statement, zero (although you can bend the verbosity of the conformance statement into words that read like technologically agnostic. But make it explicit, don’t write it five times to me and not put it in the docs. Make new reference implementations, or borrow them from your implementers. Use plain-language, and just make it explicit.

    Better still, let’s evolve the presentation of the PREMIS standard (away from separate PDFs), and use a modern documentation framework (e.g. Diataxis), and put it into public versioned source control, and give us a way that we can help write the documentation with you to make things like this clearer.

    While the EC’s response to me labor on the idea I have missed the fact that PREMIS is technology agnostic I wrote the original paper to amplify previous conversations and keep them relevant because they were formative for me, and I hope that they will be formative for others.

    I also wrote the original paper as more of a technology paper than a PREMIS paper (honouring PREMIS of course) but I make a very clear conclusion that is very much inclusive of PREMIS:

    It is this paper’s assertion that we can store more, and “do more” by taking an event-sourced approach to storing events associated with the “objects” described in the PREMIS data dictionary. 

    I can nuance this further:

    • Store events about your digital objects and try to make sure some of those events can be aligned with PREMIS, 
    • Store events because events happen on a continuum, don’t fall into the trap of storing state,
    • Create representations of your data, PREMIS might be one, access reports and logs might be another, feature analyses might be another, don’t limit yourself to one schema, use many. 

    My paper is about trying to fit older trusted paradigms into modern development practices. It’s about moving away from dogmatic adherence to the past while honouring something that exists. 

    We can do PREMIS exactly the same as we do it now, as long as we don’t put it front and centre of our implementation.

    How to respond to a “well-actually”?

    Well-actually… https://www.recurse.com/social-rules#no-well-actuallys 

    There are some editorial quirks in my paper, the one I am most embarrassed by is when my writing conflated the data model with the events in the Library of Congress controlled vocabulary (what other controlled vocabularies have other folks been using in the last decade? Next PREMIS revision, please, put those listings in there or open the editorial process to modern practices). Conflating these two things in one paragraph should hardly be the thread that untangles the entire piece.

    The PREMIS EC haven’t reached out to me before publication, or after, yet as I point out, they all know where to find me (I wasn’t able to make the PREMIS birds-of-a-feather at iPRES (probably a good thing while this seems to have been in the air) but I was at the conference). Their response though does something strange, directing their efforts at things I might not have understood, may seemingly be getting at; or pointing out what I am “really saying here”. It is a patronising approach. For the gaps they filled in on my behalf, I would happily have provided clarity, offering me the opportunity to respond in a less reactive way, or perhaps all of us a chance to collaborate.

    Their response appeals to authority, and its two references are my article and the PREMIS data dictionary. I am sure there was a more neutral, reflective, and holistic way to approach this work by focusing on the entirety of the article and its spirit, and giving the benefit of the doubt to what is perceived as the author’s “mistakes” or “misreadings”. A show don’t tell approach might have helped, and would certainly be valuable, e.g. spending more time implementing examples that lent themselves to updating future revisions of the data dictionary and conformance statements. 

     ¯\_(ツ)_/¯

    Anyway folks. ¯\_(ツ)_/¯ Interpretation is tricky? I imagine that the PREMIS EC will find fault with the above text, but to try to avoid another article on the subject of my misinterpretation: The PREMIS EC aren’t foisting the standard on you and I most definitely am not. Read their docs if you do choose PREMIS. Technology changes and so do standards. I feel we have an obligation to modernise (and demonstrate modernisation) with those changes.  I feel we have an obligation to question, and evaluate as time moves on; especially when technology is front and centre of how we support our archivists and librarians.

    Hopefully people reading this can continue to read the original paper for what it is. There may be some potentially interesting ideas and conclusions that a pure PREMIS discussion distracts from, including what event-sourced data might mean for activating information supporting digital preservation.

    Hopefully too, from this engagement, the PREMIS EC will take an opportunity to fold some of their own response into their own documentation and guidance.

    Thanks for reading.

    PREMIS conformance statement (2015): https://www.loc.gov/standards/premis/premis-conformance-20150429.pdf

    PREMIS data dictionary (Version 3.0 (2015)): https://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf

     

    https://exponentialdecay.co.uk/blog/dont-implement-premis-represent-it/

    #Code #Coding #Data #digipres #DigitalPreservation #PREMIS #WDPD #WorldDigitalPreservationDay

  7. Back by popular demand, the #PRONOM team will be running their yearly hackathon on 7th November-15th November, to celebrate #WDPD

    They will be kicking off the week with a PRONOM Open Drop-In session on the 7th dedicated to answering your questions.

    openpreservation.org/news/pron

  8. @mickylindlar @Thorsted for “Only errors in the files,” it’s gotta be #wtfpdf. This would actually be so fun for #WDPD (World Digital Preservation Day)