home.social

#pronom — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #pronom, aggregated by home.social.

  1. Le registre de format de fichiers #PRONOM fait peau neuve - vous pouvez jeter un coup d'oeil à sa nouvelle mouture ici : pronom.nationalarchives.gov.uk/

    (Mais la mue n'est pas encore terminée.)

  2. Now is the time for a shout-out to @exponentialdecay for his file format signature development utility (ffdev.info/) and the SPARQL endpoint to #PRONOM data (btw, I can't find the URL any more...!).

    And also to @richardlehane for Siegfried + Roy, always incredibly useful tools.

    #digipres

  3. @Thorsted @leninoc @britpunk80

    is there a chance that someone would show up at the #PRONOM drop-in session this week??

    I have a few questions on LaTeX files, PDF/A and Virtual Instruments files 😀 !

  4. Hey #PRONOM folks! I was wondering whether some signatures of source code files (e.g., nationalarchives.gov.uk/PRONOM ) could be improved by adding a pattern based on the shebang (en.wikipedia.org/wiki/Shebang_)? Would there be side effects?

  5. A couple exciting updates in the #fileformat space this week. First #PRONOM released a new signature, v121, with 29 new PUIDs, 29 new signatures and 18 updates. Including an APK signature! @BertrandCaron #digipres nationalarchives.gov.uk/abouta 1/2

  6. Les compléments d’objet indirects : aspects syntaxiques

    Plan de l’article :

    I. Définition générale
    II. Préposition inaugurale et nature syntaxique
    III. Règles de transformation

    IV. Conclusions et bibliographie

    I. Définition générale

    Que sont les compléments d’objet indirects (COI) ?

    Les compléments d’objet indirects (COI) sont reconnus par la tradition grammaticale comme des compléments essentiels du verbe, à l’aune des compléments d’objet directs (COD). Ils se caractérisent, au regard de ces derniers, par leur syntaxe particulièrement et notamment par la préposition inaugurale qui les introduit (1).

    (1) Je parle à Jean.

    Leur repérage, cependant, est plus complexe dans la mesure où ils ressemblent, superficiellement, à d’autres types de groupes prépositionnels, notamment la famille des compléments dits « circonstanciels », des compléments à valeur scénique ou de certains compléments de phrase, qui partagent d’ailleurs parfois certaines de leurs propriétés. Ces problèmes ont été, en grammaire scolaire, longtemps indépassables : et il était fréquent que les manuels identifient comme des COI des compléments circonstanciels, et réciproquement.

    Historiquement, il y a effectivement une relation entre ces compléments : un certain nombre de COI ont été, dans l’histoire de la langue française, des compléments circonstanciels qui ont été progressivement intégrés dans la valence verbale. En effet, un certain nombre de ces compléments, parce qu’ils accompagnaient très souvent un verbe et étaient cohérents avec son sémantisme, ont fini par développer une relation de solidarité assez forte avec le verbe et devenir un de ses actants.

    Le COI se définit donc comme un complément essentiel du verbe, introduit par une préposition et distinct, par ses propriétés, des autres types de groupes prépositionnels.

    Le lien, cependant, entre le COI et le verbe est plus lâche qu’avec un COD ou un attribut, dans la mesure où l’on a précisément besoin d’une préposition pour assurer la relation avec le verbe. En ce sens, et au-delà des paramètres syntaxiques que l’on énumèrera ci-après, le paramètre sémantique est essentiel pour identifier les COI. C’est en effet le contexte, et la relation de sens entre le verbe et le COI, qui orientera l’analyse.

    Ainsi, un complément locatif du type à l’école sera bien un COI du verbe aller, dans la mesure où le sens du verbe suppose un complément indiquant le point d’arrivée du mouvement ; en revanche, il sera davantage un complément circonstanciel, à valeur scénique, derrière un verbe comme parler puisque son sémantisme, ou son « drame » pour reprendre la formule de Tesnières, n’implique pas une précision locative au regard du schéma actanciel du verbe où l’on attendrait davantage la personne à qui l’on parle, ou le sujet de la discussion.

    (2a) Je vais à l’école (COI)
    (2b) Je parle à l’école (circonstant) de mathématiques (COI)

    Dans cet article, nous ne reviendrons pas sur ces aspects sémantiques, qui feront l’objet d’un développement approfondi dans un futur billet sur les circonstants. Il y a, en revanche, des éléments syntaxiques assez stables sur lesquels il est bon de revenir ici pour identifier les COI.

    II. Préposition inaugurale et nature syntaxique

    La préposition introduisant le COI demeure l’un de ses traits fondamentaux : c’est ce qui le distingue notamment des COD et des attributs. En revanche, la nature du COI peut être diverse. On peut trouver là des noyaux nominaux (substantifs ou pronoms), des infinitifs (forme « quasi-nominale » du verbe) ou des subordonnées, complétives ou intégratives (dites encore « indéfinies »).

    (3a) Je parle de Pierre / de moi (noyau nominal)
    (3b) Je parle de partir (noyau infinitif)
    (3c) Je parle de ce que je veux (noyau subordonnée complétive)
    (3d) Je parle de qui je veux (noyau subordonnée indéfinie)

    Les prépositions introduisant des COI sont également multiples. Outre la triade à/de/en, composée des prépositions les plus usuelles du français, nous pouvons également trouver, toujours selon le sémantisme du verbe, d’autres prépositions au sens plus transparent comme sur (je m’assois sur une chaise), contre (je m’appuie contre le mur) ou pour (je vote pour mon candidat). On retiendra cependant deux éléments les concernant :

    (i) D’une part, le choix de la préposition est contraint par le verbe. Si certains d’entre eux autorisent, avec différents effets de sens, une certaine variation, la chose est rare en français.

    (4a) Je parle à/de/pour Jean.
    (4b) *Je vais selon l’école

    (ii) D’autre part, il faut que le sens de la préposition, dans le cas où celle-ci n’est pas à, de ou en, soit cohérent avec le verbe. Ainsi, on acceptera volontiers une préposition locative avec un verbe de mouvement (5a), mais il sera plus difficile d’employer une préposition liée au but ou à l’intention (5b).

    (5a) Il parvient jusqu’au sommet.
    (5b) *Il parvient pour le sommet.

    C’est précisément parce qu’il y a cohérence entre le sens du verbe et la préposition qu’historiquement, la réanalyse du circonstant en COI a pu se faire progressivement. On notera d’ailleurs que la préposition permet de distinguer divers sens à un verbe, en fonction du mode de construction du complément :

    (6a) Je connais Jean.
    (6b) Le juge connaît de l’affaire (= « être capable de juger l’affaire »)

    Parfois encore, le choix de la préposition oriente l’interprétation, avec des nuances plus ou moins fines. On a vu récemment, dans la langue moderne, se stabiliser une opposition entre habiter à Paris et habiter sur Paris, la préposition sur indiquant une localisation plus lointaine ou plus vague (à Paris = intra-muros ; sur Paris = dans le voisinage de Paris, en banlieue proche par exemple). Aussi, l’usage continue de modifier la valence verbale en s’appuyant sur la complexité des prépositions, pour déterminer des effets de sens nouveaux.

    III. Règles de transformation

    Certaines règles de transformation syntaxique permettent également d’orienter l’analyse, et de distinguer les « vrais » COI, c’est-à-dire les actants du verbe, d’autres types de groupes prépositionnels, en jouant sur le lien syntaxique que le COI entretient avec son verbe. Notamment les COI peut être pronominalisés en position préverbale :

    (7a) Je parle de Jean <=> J’en parle.
    (7b) Je parle lentement <≠> *Je le parle.
    (7c) Je parle à voix basse <≠> *J’y parle

    Au regard des COD ou des attributs en revanche, les règles de pronominalisation de COI sont un peu plus complexes. On doit notamment distinguer trois régimes de transformation, en fonction et de la nature de la préposition inaugurale, et du statut référentiel du COI selon le paramètre +/- humain. On distinguera alors :

    (i) Un premier régime avec les COI introduits par à. La pronominalisation s’effectue alors soit par y pour les COI -humain (8a), soit par lui pour les COI +humain (8b). Dans ce dernier cas, le pronom lui ne marque pas le genre masculin ou féminin, que ce soit au niveau grammatical ou ontologique.

    (8a) Je réponds à son courrier <=> J’y réponds.
    (8b) Je réponds à Marie <=> Je lui réponds.

    Dans certains cas, la transformation peut s’effectuer en conservant un GP introduit par à, suivi de lui/elle(s)/eux/ça, en parallèle de la pronominalisation en y. C’est un choix fait pour lever, occasionnellement, une ambiguïté interprétative. Ainsi, (9a) est tant la transformation de (9b) que de (9c).

    (9a) J’y pense.
    (9b) Je pense à l’avenir (Je pense à ça)
    (9c) Je pense à mes enfants (Je pense à eux)

    On notera également que y tend néanmoins à se spécialiser dans le non-humain : c’est l’interprétation préférentielle, et certaines variétés diatopiques (dans le lyonnais par exemple) étend cette propriété au COD, pour distinguer la référence des compléments au regard du pronom objet le/la (Je le [Jean] vois vs. J’y [la table] vois).

    (ii) Les COI introduits par de se pronominalisent tous par en. Ce pronom est véritablement lié au mot-forme de, puisqu’on le retrouve également pour la transformation des COD introduits par le partitif ou le déterminant indéfini de. Il faut donc veiller à ne pas confondre les formes entre elles, et de vérifier le statut de de, préposition ou déterminant.

    (10a) Je parle de Jean <=> J’en parle (COI)
    (10b) Je veux de l’eau <=> J’en veux (COD)

    (iii) Enfin, les autres types de COI se pronominalisent sous la forme préposition + pronom pour les animés :

    (11a) Jean tourne autour de Marie <=> Jean tourne autour d’elle.
    (11b) Je compte sur Jean <=> Je compte sur lui.

    Ou, pour les inanimés, par un rappel de la préposition « seule », sans le reste du syntagme.

    (12) J’ai voté contre la loi <=> J’ai voté contre.

    L’identification de ces derniers compléments comme COI est parfois discutée, mais deux arguments peuvent être avancés pour conduire l’analyse : d’une part, la pronominalisation avec lui est encore autorisée pour les animés (13a), même si certaines grammaires associent la transformation à un niveau de langue populaire ou relâchée. D’autre part, le détachement en tête d’énoncé est senti comme incorrect ou maladroit (13b). Or, le COI étant un complément verbal, on ne peut le déplacer librement comme on peut le faire avec un complément à valeur scénique.

    (13a) Jean lui tourne autour.
    (13b) ?Autour de Marie, Jean tourne.

    Ce test de déplacement en tête d’énoncé est d’ailleurs crucial. Si on peut toujours le faire pour les COI, on notera qu’il demande un rappel par cataphore d’un pronom en position préverbale pour assurer la grammaticalité de l’énoncé, ce qui n’y pas le cas des compléments à valeur scénique (14).

    (14a) (À) Jean, je lui parle / ?(À Jean), je parle
    (14b) Sur le quai, j’ (*y) attends.

    La complexité de ces analyses, et le fait qu’elles fassent appel à notre sentiment de langue, empêche cependant d’avoir des certitudes absolues pour certains compléments. En diachronie de même, il est pour ainsi dire impossible de mener la discussion, comme nous ne pouvons pas faire appel à ce sentiment linguistique.

    IV. Conclusions et bibliographie

    Les COI nous rappellent, si besoin était, que rien dans l’analyse de langue n’est absolument indiscutable : les phénomènes grammaticaux ne sont pas des équations mathématiques à résoudre, et une part d’interprétation sera toujours nécessaire dans l’analyse même si des tests et des outils nous permettent d’orienter les discussions. Les COI sont des témoins privilégiés de cette observation, comme ils se situent à la frontière entre les actants du verbe et les circonstants, sans même rentrer dans le terrain, difficile, de l’évolution historique ou de la variation géographique.

    Parmi les références que nous pouvons donner :

    • Jacqueline Pinchon a écrit, en 1972, une étude sur Les pronoms adverbiaux en et y, hélas non réédité. Sa consultation permettra cependant d’y voir plus clair sur cette question épineuse.
    • Outre les références données dans l’article sur les prépositions, qui serviront également pour la discussion, on lira avec attention l‘article de Le Querier (1999), sur Fin de partie de Beckett, pour un point de vue stylistique/sémantique sur la question.

    Site sous licence Creative Commons (CC BY-NC-ND 4.0) : partage autorisé, sous couvert de citation et d’attribution de la source originale. Modification et utilisation commerciale formellement interdites (lien)

    #complément #complémentCirconstanciel #complémentDObjetDirect #complémentDObjetIndirect #grammaire #MathieuGoux #pronom #Syntaxe #valenceVerbale #verbe

  7. Les compléments d’objet indirects : aspects syntaxiques

    Plan de l’article :

    I. Définition générale
    II. Préposition inaugurale et nature syntaxique
    III. Règles de transformation

    IV. Conclusions et bibliographie

    I. Définition générale

    Que sont les compléments d’objet indirects (COI) ?

    Les compléments d’objet indirects (COI) sont reconnus par la tradition grammaticale comme des compléments essentiels du verbe, à l’aune des compléments d’objet directs (COD). Ils se caractérisent, au regard de ces derniers, par leur syntaxe particulièrement et notamment par la préposition inaugurale qui les introduit (1).

    (1) Je parle à Jean.

    Leur repérage, cependant, est plus complexe dans la mesure où ils ressemblent, superficiellement, à d’autres types de groupes prépositionnels, notamment la famille des compléments dits « circonstanciels », des compléments à valeur scénique ou de certains compléments de phrase, qui partagent d’ailleurs parfois certaines de leurs propriétés. Ces problèmes ont été, en grammaire scolaire, longtemps indépassables : et il était fréquent que les manuels identifient comme des COI des compléments circonstanciels, et réciproquement.

    Historiquement, il y a effectivement une relation entre ces compléments : un certain nombre de COI ont été, dans l’histoire de la langue française, des compléments circonstanciels qui ont été progressivement intégrés dans la valence verbale. En effet, un certain nombre de ces compléments, parce qu’ils accompagnaient très souvent un verbe et étaient cohérents avec son sémantisme, ont fini par développer une relation de solidarité assez forte avec le verbe et devenir un de ses actants.

    Le COI se définit donc comme un complément essentiel du verbe, introduit par une préposition et distinct, par ses propriétés, des autres types de groupes prépositionnels.

    Le lien, cependant, entre le COI et le verbe est plus lâche qu’avec un COD ou un attribut, dans la mesure où l’on a précisément besoin d’une préposition pour assurer la relation avec le verbe. En ce sens, et au-delà des paramètres syntaxiques que l’on énumèrera ci-après, le paramètre sémantique est essentiel pour identifier les COI. C’est en effet le contexte, et la relation de sens entre le verbe et le COI, qui orientera l’analyse.

    Ainsi, un complément locatif du type à l’école sera bien un COI du verbe aller, dans la mesure où le sens du verbe suppose un complément indiquant le point d’arrivée du mouvement ; en revanche, il sera davantage un complément circonstanciel, à valeur scénique, derrière un verbe comme parler puisque son sémantisme, ou son « drame » pour reprendre la formule de Tesnières, n’implique pas une précision locative au regard du schéma actanciel du verbe où l’on attendrait davantage la personne à qui l’on parle, ou le sujet de la discussion.

    (2a) Je vais à l’école (COI)
    (2b) Je parle à l’école (circonstant) de mathématiques (COI)

    Dans cet article, nous ne reviendrons pas sur ces aspects sémantiques, qui feront l’objet d’un développement approfondi dans un futur billet sur les circonstants. Il y a, en revanche, des éléments syntaxiques assez stables sur lesquels il est bon de revenir ici pour identifier les COI.

    II. Préposition inaugurale et nature syntaxique

    La préposition introduisant le COI demeure l’un de ses traits fondamentaux : c’est ce qui le distingue notamment des COD et des attributs. En revanche, la nature du COI peut être diverse. On peut trouver là des noyaux nominaux (substantifs ou pronoms), des infinitifs (forme « quasi-nominale » du verbe) ou des subordonnées, complétives ou intégratives (dites encore « indéfinies »).

    (3a) Je parle de Pierre / de moi (noyau nominal)
    (3b) Je parle de partir (noyau infinitif)
    (3c) Je parle de ce que je veux (noyau subordonnée complétive)
    (3d) Je parle de qui je veux (noyau subordonnée indéfinie)

    Les prépositions introduisant des COI sont également multiples. Outre la triade à/de/en, composée des prépositions les plus usuelles du français, nous pouvons également trouver, toujours selon le sémantisme du verbe, d’autres prépositions au sens plus transparent comme sur (je m’assois sur une chaise), contre (je m’appuie contre le mur) ou pour (je vote pour mon candidat). On retiendra cependant deux éléments les concernant :

    (i) D’une part, le choix de la préposition est contraint par le verbe. Si certains d’entre eux autorisent, avec différents effets de sens, une certaine variation, la chose est rare en français.

    (4a) Je parle à/de/pour Jean.
    (4b) *Je vais selon l’école

    (ii) D’autre part, il faut que le sens de la préposition, dans le cas où celle-ci n’est pas à, de ou en, soit cohérent avec le verbe. Ainsi, on acceptera volontiers une préposition locative avec un verbe de mouvement (5a), mais il sera plus difficile d’employer une préposition liée au but ou à l’intention (5b).

    (5a) Il parvient jusqu’au sommet.
    (5b) *Il parvient pour le sommet.

    C’est précisément parce qu’il y a cohérence entre le sens du verbe et la préposition qu’historiquement, la réanalyse du circonstant en COI a pu se faire progressivement. On notera d’ailleurs que la préposition permet de distinguer divers sens à un verbe, en fonction du mode de construction du complément :

    (6a) Je connais Jean.
    (6b) Le juge connaît de l’affaire (= « être capable de juger l’affaire »)

    Parfois encore, le choix de la préposition oriente l’interprétation, avec des nuances plus ou moins fines. On a vu récemment, dans la langue moderne, se stabiliser une opposition entre habiter à Paris et habiter sur Paris, la préposition sur indiquant une localisation plus lointaine ou plus vague (à Paris = intra-muros ; sur Paris = dans le voisinage de Paris, en banlieue proche par exemple). Aussi, l’usage continue de modifier la valence verbale en s’appuyant sur la complexité des prépositions, pour déterminer des effets de sens nouveaux.

    III. Règles de transformation

    Certaines règles de transformation syntaxique permettent également d’orienter l’analyse, et de distinguer les « vrais » COI, c’est-à-dire les actants du verbe, d’autres types de groupes prépositionnels, en jouant sur le lien syntaxique que le COI entretient avec son verbe. Notamment les COI peut être pronominalisés en position préverbale :

    (7a) Je parle de Jean <=> J’en parle.
    (7b) Je parle lentement <≠> *Je le parle.
    (7c) Je parle à voix basse <≠> *J’y parle

    Au regard des COD ou des attributs en revanche, les règles de pronominalisation de COI sont un peu plus complexes. On doit notamment distinguer trois régimes de transformation, en fonction et de la nature de la préposition inaugurale, et du statut référentiel du COI selon le paramètre +/- humain. On distinguera alors :

    (i) Un premier régime avec les COI introduits par à. La pronominalisation s’effectue alors soit par y pour les COI -humain (8a), soit par lui pour les COI +humain (8b). Dans ce dernier cas, le pronom lui ne marque pas le genre masculin ou féminin, que ce soit au niveau grammatical ou ontologique.

    (8a) Je réponds à son courrier <=> J’y réponds.
    (8b) Je réponds à Marie <=> Je lui réponds.

    Dans certains cas, la transformation peut s’effectuer en conservant un GP introduit par à, suivi de lui/elle(s)/eux/ça, en parallèle de la pronominalisation en y. C’est un choix fait pour lever, occasionnellement, une ambiguïté interprétative. Ainsi, (9a) est tant la transformation de (9b) que de (9c).

    (9a) J’y pense.
    (9b) Je pense à l’avenir (Je pense à ça)
    (9c) Je pense à mes enfants (Je pense à eux)

    On notera également que y tend néanmoins à se spécialiser dans le non-humain : c’est l’interprétation préférentielle, et certaines variétés diatopiques (dans le lyonnais par exemple) étend cette propriété au COD, pour distinguer la référence des compléments au regard du pronom objet le/la (Je le [Jean] vois vs. J’y [la table] vois).

    (ii) Les COI introduits par de se pronominalisent tous par en. Ce pronom est véritablement lié au mot-forme de, puisqu’on le retrouve également pour la transformation des COD introduits par le partitif ou le déterminant indéfini de. Il faut donc veiller à ne pas confondre les formes entre elles, et de vérifier le statut de de, préposition ou déterminant.

    (10a) Je parle de Jean <=> J’en parle (COI)
    (10b) Je veux de l’eau <=> J’en veux (COD)

    (iii) Enfin, les autres types de COI se pronominalisent sous la forme préposition + pronom pour les animés :

    (11a) Jean tourne autour de Marie <=> Jean tourne autour d’elle.
    (11b) Je compte sur Jean <=> Je compte sur lui.

    Ou, pour les inanimés, par un rappel de la préposition « seule », sans le reste du syntagme.

    (12) J’ai voté contre la loi <=> J’ai voté contre.

    L’identification de ces derniers compléments comme COI est parfois discutée, mais deux arguments peuvent être avancés pour conduire l’analyse : d’une part, la pronominalisation avec lui est encore autorisée pour les animés (13a), même si certaines grammaires associent la transformation à un niveau de langue populaire ou relâchée. D’autre part, le détachement en tête d’énoncé est senti comme incorrect ou maladroit (13b). Or, le COI étant un complément verbal, on ne peut le déplacer librement comme on peut le faire avec un complément à valeur scénique.

    (13a) Jean lui tourne autour.
    (13b) ?Autour de Marie, Jean tourne.

    Ce test de déplacement en tête d’énoncé est d’ailleurs crucial. Si on peut toujours le faire pour les COI, on notera qu’il demande un rappel par cataphore d’un pronom en position préverbale pour assurer la grammaticalité de l’énoncé, ce qui n’y pas le cas des compléments à valeur scénique (14).

    (14a) (À) Jean, je lui parle / ?(À Jean), je parle
    (14b) Sur le quai, j’ (*y) attends.

    La complexité de ces analyses, et le fait qu’elles fassent appel à notre sentiment de langue, empêche cependant d’avoir des certitudes absolues pour certains compléments. En diachronie de même, il est pour ainsi dire impossible de mener la discussion, comme nous ne pouvons pas faire appel à ce sentiment linguistique.

    IV. Conclusions et bibliographie

    Les COI nous rappellent, si besoin était, que rien dans l’analyse de langue n’est absolument indiscutable : les phénomènes grammaticaux ne sont pas des équations mathématiques à résoudre, et une part d’interprétation sera toujours nécessaire dans l’analyse même si des tests et des outils nous permettent d’orienter les discussions. Les COI sont des témoins privilégiés de cette observation, comme ils se situent à la frontière entre les actants du verbe et les circonstants, sans même rentrer dans le terrain, difficile, de l’évolution historique ou de la variation géographique.

    Parmi les références que nous pouvons donner :

    • Jacqueline Pinchon a écrit, en 1972, une étude sur Les pronoms adverbiaux en et y, hélas non réédité. Sa consultation permettra cependant d’y voir plus clair sur cette question épineuse.
    • Outre les références données dans l’article sur les prépositions, qui serviront également pour la discussion, on lira avec attention l‘article de Le Querier (1999), sur Fin de partie de Beckett, pour un point de vue stylistique/sémantique sur la question.

    Site sous licence Creative Commons (CC BY-NC-ND 4.0) : partage autorisé, sous couvert de citation et d’attribution de la source originale. Modification et utilisation commerciale formellement interdites (lien)

    #complément #complémentCirconstanciel #complémentDObjetDirect #complémentDObjetIndirect #grammaire #MathieuGoux #pronom #Syntaxe #valenceVerbale #verbe

  8. Les compléments d’objet indirects : aspects syntaxiques

    Plan de l’article :

    I. Définition générale
    II. Préposition inaugurale et nature syntaxique
    III. Règles de transformation

    IV. Conclusions et bibliographie

    I. Définition générale

    Que sont les compléments d’objet indirects (COI) ?

    Les compléments d’objet indirects (COI) sont reconnus par la tradition grammaticale comme des compléments essentiels du verbe, à l’aune des compléments d’objet directs (COD). Ils se caractérisent, au regard de ces derniers, par leur syntaxe particulièrement et notamment par la préposition inaugurale qui les introduit (1).

    (1) Je parle à Jean.

    Leur repérage, cependant, est plus complexe dans la mesure où ils ressemblent, superficiellement, à d’autres types de groupes prépositionnels, notamment la famille des compléments dits « circonstanciels », des compléments à valeur scénique ou de certains compléments de phrase, qui partagent d’ailleurs parfois certaines de leurs propriétés. Ces problèmes ont été, en grammaire scolaire, longtemps indépassables : et il était fréquent que les manuels identifient comme des COI des compléments circonstanciels, et réciproquement.

    Historiquement, il y a effectivement une relation entre ces compléments : un certain nombre de COI ont été, dans l’histoire de la langue française, des compléments circonstanciels qui ont été progressivement intégrés dans la valence verbale. En effet, un certain nombre de ces compléments, parce qu’ils accompagnaient très souvent un verbe et étaient cohérents avec son sémantisme, ont fini par développer une relation de solidarité assez forte avec le verbe et devenir un de ses actants.

    Le COI se définit donc comme un complément essentiel du verbe, introduit par une préposition et distinct, par ses propriétés, des autres types de groupes prépositionnels.

    Le lien, cependant, entre le COI et le verbe est plus lâche qu’avec un COD ou un attribut, dans la mesure où l’on a précisément besoin d’une préposition pour assurer la relation avec le verbe. En ce sens, et au-delà des paramètres syntaxiques que l’on énumèrera ci-après, le paramètre sémantique est essentiel pour identifier les COI. C’est en effet le contexte, et la relation de sens entre le verbe et le COI, qui orientera l’analyse.

    Ainsi, un complément locatif du type à l’école sera bien un COI du verbe aller, dans la mesure où le sens du verbe suppose un complément indiquant le point d’arrivée du mouvement ; en revanche, il sera davantage un complément circonstanciel, à valeur scénique, derrière un verbe comme parler puisque son sémantisme, ou son « drame » pour reprendre la formule de Tesnières, n’implique pas une précision locative au regard du schéma actanciel du verbe où l’on attendrait davantage la personne à qui l’on parle, ou le sujet de la discussion.

    (2a) Je vais à l’école (COI)
    (2b) Je parle à l’école (circonstant) de mathématiques (COI)

    Dans cet article, nous ne reviendrons pas sur ces aspects sémantiques, qui feront l’objet d’un développement approfondi dans un futur billet sur les circonstants. Il y a, en revanche, des éléments syntaxiques assez stables sur lesquels il est bon de revenir ici pour identifier les COI.

    II. Préposition inaugurale et nature syntaxique

    La préposition introduisant le COI demeure l’un de ses traits fondamentaux : c’est ce qui le distingue notamment des COD et des attributs. En revanche, la nature du COI peut être diverse. On peut trouver là des noyaux nominaux (substantifs ou pronoms), des infinitifs (forme « quasi-nominale » du verbe) ou des subordonnées, complétives ou intégratives (dites encore « indéfinies »).

    (3a) Je parle de Pierre / de moi (noyau nominal)
    (3b) Je parle de partir (noyau infinitif)
    (3c) Je parle de ce que je veux (noyau subordonnée complétive)
    (3d) Je parle de qui je veux (noyau subordonnée indéfinie)

    Les prépositions introduisant des COI sont également multiples. Outre la triade à/de/en, composée des prépositions les plus usuelles du français, nous pouvons également trouver, toujours selon le sémantisme du verbe, d’autres prépositions au sens plus transparent comme sur (je m’assois sur une chaise), contre (je m’appuie contre le mur) ou pour (je vote pour mon candidat). On retiendra cependant deux éléments les concernant :

    (i) D’une part, le choix de la préposition est contraint par le verbe. Si certains d’entre eux autorisent, avec différents effets de sens, une certaine variation, la chose est rare en français.

    (4a) Je parle à/de/pour Jean.
    (4b) *Je vais selon l’école

    (ii) D’autre part, il faut que le sens de la préposition, dans le cas où celle-ci n’est pas à, de ou en, soit cohérent avec le verbe. Ainsi, on acceptera volontiers une préposition locative avec un verbe de mouvement (5a), mais il sera plus difficile d’employer une préposition liée au but ou à l’intention (5b).

    (5a) Il parvient jusqu’au sommet.
    (5b) *Il parvient pour le sommet.

    C’est précisément parce qu’il y a cohérence entre le sens du verbe et la préposition qu’historiquement, la réanalyse du circonstant en COI a pu se faire progressivement. On notera d’ailleurs que la préposition permet de distinguer divers sens à un verbe, en fonction du mode de construction du complément :

    (6a) Je connais Jean.
    (6b) Le juge connaît de l’affaire (= « être capable de juger l’affaire »)

    Parfois encore, le choix de la préposition oriente l’interprétation, avec des nuances plus ou moins fines. On a vu récemment, dans la langue moderne, se stabiliser une opposition entre habiter à Paris et habiter sur Paris, la préposition sur indiquant une localisation plus lointaine ou plus vague (à Paris = intra-muros ; sur Paris = dans le voisinage de Paris, en banlieue proche par exemple). Aussi, l’usage continue de modifier la valence verbale en s’appuyant sur la complexité des prépositions, pour déterminer des effets de sens nouveaux.

    III. Règles de transformation

    Certaines règles de transformation syntaxique permettent également d’orienter l’analyse, et de distinguer les « vrais » COI, c’est-à-dire les actants du verbe, d’autres types de groupes prépositionnels, en jouant sur le lien syntaxique que le COI entretient avec son verbe. Notamment les COI peut être pronominalisés en position préverbale :

    (7a) Je parle de Jean <=> J’en parle.
    (7b) Je parle lentement <≠> *Je le parle.
    (7c) Je parle à voix basse <≠> *J’y parle

    Au regard des COD ou des attributs en revanche, les règles de pronominalisation de COI sont un peu plus complexes. On doit notamment distinguer trois régimes de transformation, en fonction et de la nature de la préposition inaugurale, et du statut référentiel du COI selon le paramètre +/- humain. On distinguera alors :

    (i) Un premier régime avec les COI introduits par à. La pronominalisation s’effectue alors soit par y pour les COI -humain (8a), soit par lui pour les COI +humain (8b). Dans ce dernier cas, le pronom lui ne marque pas le genre masculin ou féminin, que ce soit au niveau grammatical ou ontologique.

    (8a) Je réponds à son courrier <=> J’y réponds.
    (8b) Je réponds à Marie <=> Je lui réponds.

    Dans certains cas, la transformation peut s’effectuer en conservant un GP introduit par à, suivi de lui/elle(s)/eux/ça, en parallèle de la pronominalisation en y. C’est un choix fait pour lever, occasionnellement, une ambiguïté interprétative. Ainsi, (9a) est tant la transformation de (9b) que de (9c).

    (9a) J’y pense.
    (9b) Je pense à l’avenir (Je pense à ça)
    (9c) Je pense à mes enfants (Je pense à eux)

    On notera également que y tend néanmoins à se spécialiser dans le non-humain : c’est l’interprétation préférentielle, et certaines variétés diatopiques (dans le lyonnais par exemple) étend cette propriété au COD, pour distinguer la référence des compléments au regard du pronom objet le/la (Je le [Jean] vois vs. J’y [la table] vois).

    (ii) Les COI introduits par de se pronominalisent tous par en. Ce pronom est véritablement lié au mot-forme de, puisqu’on le retrouve également pour la transformation des COD introduits par le partitif ou le déterminant indéfini de. Il faut donc veiller à ne pas confondre les formes entre elles, et de vérifier le statut de de, préposition ou déterminant.

    (10a) Je parle de Jean <=> J’en parle (COI)
    (10b) Je veux de l’eau <=> J’en veux (COD)

    (iii) Enfin, les autres types de COI se pronominalisent sous la forme préposition + pronom pour les animés :

    (11a) Jean tourne autour de Marie <=> Jean tourne autour d’elle.
    (11b) Je compte sur Jean <=> Je compte sur lui.

    Ou, pour les inanimés, par un rappel de la préposition « seule », sans le reste du syntagme.

    (12) J’ai voté contre la loi <=> J’ai voté contre.

    L’identification de ces derniers compléments comme COI est parfois discutée, mais deux arguments peuvent être avancés pour conduire l’analyse : d’une part, la pronominalisation avec lui est encore autorisée pour les animés (13a), même si certaines grammaires associent la transformation à un niveau de langue populaire ou relâchée. D’autre part, le détachement en tête d’énoncé est senti comme incorrect ou maladroit (13b). Or, le COI étant un complément verbal, on ne peut le déplacer librement comme on peut le faire avec un complément à valeur scénique.

    (13a) Jean lui tourne autour.
    (13b) ?Autour de Marie, Jean tourne.

    Ce test de déplacement en tête d’énoncé est d’ailleurs crucial. Si on peut toujours le faire pour les COI, on notera qu’il demande un rappel par cataphore d’un pronom en position préverbale pour assurer la grammaticalité de l’énoncé, ce qui n’y pas le cas des compléments à valeur scénique (14).

    (14a) (À) Jean, je lui parle / ?(À Jean), je parle
    (14b) Sur le quai, j’ (*y) attends.

    La complexité de ces analyses, et le fait qu’elles fassent appel à notre sentiment de langue, empêche cependant d’avoir des certitudes absolues pour certains compléments. En diachronie de même, il est pour ainsi dire impossible de mener la discussion, comme nous ne pouvons pas faire appel à ce sentiment linguistique.

    IV. Conclusions et bibliographie

    Les COI nous rappellent, si besoin était, que rien dans l’analyse de langue n’est absolument indiscutable : les phénomènes grammaticaux ne sont pas des équations mathématiques à résoudre, et une part d’interprétation sera toujours nécessaire dans l’analyse même si des tests et des outils nous permettent d’orienter les discussions. Les COI sont des témoins privilégiés de cette observation, comme ils se situent à la frontière entre les actants du verbe et les circonstants, sans même rentrer dans le terrain, difficile, de l’évolution historique ou de la variation géographique.

    Parmi les références que nous pouvons donner :

    • Jacqueline Pinchon a écrit, en 1972, une étude sur Les pronoms adverbiaux en et y, hélas non réédité. Sa consultation permettra cependant d’y voir plus clair sur cette question épineuse.
    • Outre les références données dans l’article sur les prépositions, qui serviront également pour la discussion, on lira avec attention l‘article de Le Querier (1999), sur Fin de partie de Beckett, pour un point de vue stylistique/sémantique sur la question.

    Site sous licence Creative Commons (CC BY-NC-ND 4.0) : partage autorisé, sous couvert de citation et d’attribution de la source originale. Modification et utilisation commerciale formellement interdites (lien)

    #complément #complémentCirconstanciel #complémentDObjetDirect #complémentDObjetIndirect #grammaire #MathieuGoux #pronom #Syntaxe #valenceVerbale #verbe

  9. Les compléments d’objet indirects : aspects syntaxiques

    Plan de l’article :

    I. Définition générale
    II. Préposition inaugurale et nature syntaxique
    III. Règles de transformation

    IV. Conclusions et bibliographie

    I. Définition générale

    Que sont les compléments d’objet indirects (COI) ?

    Les compléments d’objet indirects (COI) sont reconnus par la tradition grammaticale comme des compléments essentiels du verbe, à l’aune des compléments d’objet directs (COD). Ils se caractérisent, au regard de ces derniers, par leur syntaxe particulièrement et notamment par la préposition inaugurale qui les introduit (1).

    (1) Je parle à Jean.

    Leur repérage, cependant, est plus complexe dans la mesure où ils ressemblent, superficiellement, à d’autres types de groupes prépositionnels, notamment la famille des compléments dits « circonstanciels », des compléments à valeur scénique ou de certains compléments de phrase, qui partagent d’ailleurs parfois certaines de leurs propriétés. Ces problèmes ont été, en grammaire scolaire, longtemps indépassables : et il était fréquent que les manuels identifient comme des COI des compléments circonstanciels, et réciproquement.

    Historiquement, il y a effectivement une relation entre ces compléments : un certain nombre de COI ont été, dans l’histoire de la langue française, des compléments circonstanciels qui ont été progressivement intégrés dans la valence verbale. En effet, un certain nombre de ces compléments, parce qu’ils accompagnaient très souvent un verbe et étaient cohérents avec son sémantisme, ont fini par développer une relation de solidarité assez forte avec le verbe et devenir un de ses actants.

    Le COI se définit donc comme un complément essentiel du verbe, introduit par une préposition et distinct, par ses propriétés, des autres types de groupes prépositionnels.

    Le lien, cependant, entre le COI et le verbe est plus lâche qu’avec un COD ou un attribut, dans la mesure où l’on a précisément besoin d’une préposition pour assurer la relation avec le verbe. En ce sens, et au-delà des paramètres syntaxiques que l’on énumèrera ci-après, le paramètre sémantique est essentiel pour identifier les COI. C’est en effet le contexte, et la relation de sens entre le verbe et le COI, qui orientera l’analyse.

    Ainsi, un complément locatif du type à l’école sera bien un COI du verbe aller, dans la mesure où le sens du verbe suppose un complément indiquant le point d’arrivée du mouvement ; en revanche, il sera davantage un complément circonstanciel, à valeur scénique, derrière un verbe comme parler puisque son sémantisme, ou son « drame » pour reprendre la formule de Tesnières, n’implique pas une précision locative au regard du schéma actanciel du verbe où l’on attendrait davantage la personne à qui l’on parle, ou le sujet de la discussion.

    (2a) Je vais à l’école (COI)
    (2b) Je parle à l’école (circonstant) de mathématiques (COI)

    Dans cet article, nous ne reviendrons pas sur ces aspects sémantiques, qui feront l’objet d’un développement approfondi dans un futur billet sur les circonstants. Il y a, en revanche, des éléments syntaxiques assez stables sur lesquels il est bon de revenir ici pour identifier les COI.

    II. Préposition inaugurale et nature syntaxique

    La préposition introduisant le COI demeure l’un de ses traits fondamentaux : c’est ce qui le distingue notamment des COD et des attributs. En revanche, la nature du COI peut être diverse. On peut trouver là des noyaux nominaux (substantifs ou pronoms), des infinitifs (forme « quasi-nominale » du verbe) ou des subordonnées, complétives ou intégratives (dites encore « indéfinies »).

    (3a) Je parle de Pierre / de moi (noyau nominal)
    (3b) Je parle de partir (noyau infinitif)
    (3c) Je parle de ce que je veux (noyau subordonnée complétive)
    (3d) Je parle de qui je veux (noyau subordonnée indéfinie)

    Les prépositions introduisant des COI sont également multiples. Outre la triade à/de/en, composée des prépositions les plus usuelles du français, nous pouvons également trouver, toujours selon le sémantisme du verbe, d’autres prépositions au sens plus transparent comme sur (je m’assois sur une chaise), contre (je m’appuie contre le mur) ou pour (je vote pour mon candidat). On retiendra cependant deux éléments les concernant :

    (i) D’une part, le choix de la préposition est contraint par le verbe. Si certains d’entre eux autorisent, avec différents effets de sens, une certaine variation, la chose est rare en français.

    (4a) Je parle à/de/pour Jean.
    (4b) *Je vais selon l’école

    (ii) D’autre part, il faut que le sens de la préposition, dans le cas où celle-ci n’est pas à, de ou en, soit cohérent avec le verbe. Ainsi, on acceptera volontiers une préposition locative avec un verbe de mouvement (5a), mais il sera plus difficile d’employer une préposition liée au but ou à l’intention (5b).

    (5a) Il parvient jusqu’au sommet.
    (5b) *Il parvient pour le sommet.

    C’est précisément parce qu’il y a cohérence entre le sens du verbe et la préposition qu’historiquement, la réanalyse du circonstant en COI a pu se faire progressivement. On notera d’ailleurs que la préposition permet de distinguer divers sens à un verbe, en fonction du mode de construction du complément :

    (6a) Je connais Jean.
    (6b) Le juge connaît de l’affaire (= « être capable de juger l’affaire »)

    Parfois encore, le choix de la préposition oriente l’interprétation, avec des nuances plus ou moins fines. On a vu récemment, dans la langue moderne, se stabiliser une opposition entre habiter à Paris et habiter sur Paris, la préposition sur indiquant une localisation plus lointaine ou plus vague (à Paris = intra-muros ; sur Paris = dans le voisinage de Paris, en banlieue proche par exemple). Aussi, l’usage continue de modifier la valence verbale en s’appuyant sur la complexité des prépositions, pour déterminer des effets de sens nouveaux.

    III. Règles de transformation

    Certaines règles de transformation syntaxique permettent également d’orienter l’analyse, et de distinguer les « vrais » COI, c’est-à-dire les actants du verbe, d’autres types de groupes prépositionnels, en jouant sur le lien syntaxique que le COI entretient avec son verbe. Notamment les COI peut être pronominalisés en position préverbale :

    (7a) Je parle de Jean <=> J’en parle.
    (7b) Je parle lentement <≠> *Je le parle.
    (7c) Je parle à voix basse <≠> *J’y parle

    Au regard des COD ou des attributs en revanche, les règles de pronominalisation de COI sont un peu plus complexes. On doit notamment distinguer trois régimes de transformation, en fonction et de la nature de la préposition inaugurale, et du statut référentiel du COI selon le paramètre +/- humain. On distinguera alors :

    (i) Un premier régime avec les COI introduits par à. La pronominalisation s’effectue alors soit par y pour les COI -humain (8a), soit par lui pour les COI +humain (8b). Dans ce dernier cas, le pronom lui ne marque pas le genre masculin ou féminin, que ce soit au niveau grammatical ou ontologique.

    (8a) Je réponds à son courrier <=> J’y réponds.
    (8b) Je réponds à Marie <=> Je lui réponds.

    Dans certains cas, la transformation peut s’effectuer en conservant un GP introduit par à, suivi de lui/elle(s)/eux/ça, en parallèle de la pronominalisation en y. C’est un choix fait pour lever, occasionnellement, une ambiguïté interprétative. Ainsi, (9a) est tant la transformation de (9b) que de (9c).

    (9a) J’y pense.
    (9b) Je pense à l’avenir (Je pense à ça)
    (9c) Je pense à mes enfants (Je pense à eux)

    On notera également que y tend néanmoins à se spécialiser dans le non-humain : c’est l’interprétation préférentielle, et certaines variétés diatopiques (dans le lyonnais par exemple) étend cette propriété au COD, pour distinguer la référence des compléments au regard du pronom objet le/la (Je le [Jean] vois vs. J’y [la table] vois).

    (ii) Les COI introduits par de se pronominalisent tous par en. Ce pronom est véritablement lié au mot-forme de, puisqu’on le retrouve également pour la transformation des COD introduits par le partitif ou le déterminant indéfini de. Il faut donc veiller à ne pas confondre les formes entre elles, et de vérifier le statut de de, préposition ou déterminant.

    (10a) Je parle de Jean <=> J’en parle (COI)
    (10b) Je veux de l’eau <=> J’en veux (COD)

    (iii) Enfin, les autres types de COI se pronominalisent sous la forme préposition + pronom pour les animés :

    (11a) Jean tourne autour de Marie <=> Jean tourne autour d’elle.
    (11b) Je compte sur Jean <=> Je compte sur lui.

    Ou, pour les inanimés, par un rappel de la préposition « seule », sans le reste du syntagme.

    (12) J’ai voté contre la loi <=> J’ai voté contre.

    L’identification de ces derniers compléments comme COI est parfois discutée, mais deux arguments peuvent être avancés pour conduire l’analyse : d’une part, la pronominalisation avec lui est encore autorisée pour les animés (13a), même si certaines grammaires associent la transformation à un niveau de langue populaire ou relâchée. D’autre part, le détachement en tête d’énoncé est senti comme incorrect ou maladroit (13b). Or, le COI étant un complément verbal, on ne peut le déplacer librement comme on peut le faire avec un complément à valeur scénique.

    (13a) Jean lui tourne autour.
    (13b) ?Autour de Marie, Jean tourne.

    Ce test de déplacement en tête d’énoncé est d’ailleurs crucial. Si on peut toujours le faire pour les COI, on notera qu’il demande un rappel par cataphore d’un pronom en position préverbale pour assurer la grammaticalité de l’énoncé, ce qui n’y pas le cas des compléments à valeur scénique (14).

    (14a) (À) Jean, je lui parle / ?(À Jean), je parle
    (14b) Sur le quai, j’ (*y) attends.

    La complexité de ces analyses, et le fait qu’elles fassent appel à notre sentiment de langue, empêche cependant d’avoir des certitudes absolues pour certains compléments. En diachronie de même, il est pour ainsi dire impossible de mener la discussion, comme nous ne pouvons pas faire appel à ce sentiment linguistique.

    IV. Conclusions et bibliographie

    Les COI nous rappellent, si besoin était, que rien dans l’analyse de langue n’est absolument indiscutable : les phénomènes grammaticaux ne sont pas des équations mathématiques à résoudre, et une part d’interprétation sera toujours nécessaire dans l’analyse même si des tests et des outils nous permettent d’orienter les discussions. Les COI sont des témoins privilégiés de cette observation, comme ils se situent à la frontière entre les actants du verbe et les circonstants, sans même rentrer dans le terrain, difficile, de l’évolution historique ou de la variation géographique.

    Parmi les références que nous pouvons donner :

    • Jacqueline Pinchon a écrit, en 1972, une étude sur Les pronoms adverbiaux en et y, hélas non réédité. Sa consultation permettra cependant d’y voir plus clair sur cette question épineuse.
    • Outre les références données dans l’article sur les prépositions, qui serviront également pour la discussion, on lira avec attention l‘article de Le Querier (1999), sur Fin de partie de Beckett, pour un point de vue stylistique/sémantique sur la question.

    Site sous licence Creative Commons (CC BY-NC-ND 4.0) : partage autorisé, sous couvert de citation et d’attribution de la source originale. Modification et utilisation commerciale formellement interdites (lien)

    #complément #complémentCirconstanciel #complémentDObjetDirect #complémentDObjetIndirect #grammaire #MathieuGoux #pronom #Syntaxe #valenceVerbale #verbe

  10. Les compléments d’objet indirects : aspects syntaxiques

    Plan de l’article :

    I. Définition générale
    II. Préposition inaugurale et nature syntaxique
    III. Règles de transformation

    IV. Conclusions et bibliographie

    I. Définition générale

    Que sont les compléments d’objet indirects (COI) ?

    Les compléments d’objet indirects (COI) sont reconnus par la tradition grammaticale comme des compléments essentiels du verbe, à l’aune des compléments d’objet directs (COD). Ils se caractérisent, au regard de ces derniers, par leur syntaxe particulièrement et notamment par la préposition inaugurale qui les introduit (1).

    (1) Je parle à Jean.

    Leur repérage, cependant, est plus complexe dans la mesure où ils ressemblent, superficiellement, à d’autres types de groupes prépositionnels, notamment la famille des compléments dits « circonstanciels », des compléments à valeur scénique ou de certains compléments de phrase, qui partagent d’ailleurs parfois certaines de leurs propriétés. Ces problèmes ont été, en grammaire scolaire, longtemps indépassables : et il était fréquent que les manuels identifient comme des COI des compléments circonstanciels, et réciproquement.

    Historiquement, il y a effectivement une relation entre ces compléments : un certain nombre de COI ont été, dans l’histoire de la langue française, des compléments circonstanciels qui ont été progressivement intégrés dans la valence verbale. En effet, un certain nombre de ces compléments, parce qu’ils accompagnaient très souvent un verbe et étaient cohérents avec son sémantisme, ont fini par développer une relation de solidarité assez forte avec le verbe et devenir un de ses actants.

    Le COI se définit donc comme un complément essentiel du verbe, introduit par une préposition et distinct, par ses propriétés, des autres types de groupes prépositionnels.

    Le lien, cependant, entre le COI et le verbe est plus lâche qu’avec un COD ou un attribut, dans la mesure où l’on a précisément besoin d’une préposition pour assurer la relation avec le verbe. En ce sens, et au-delà des paramètres syntaxiques que l’on énumèrera ci-après, le paramètre sémantique est essentiel pour identifier les COI. C’est en effet le contexte, et la relation de sens entre le verbe et le COI, qui orientera l’analyse.

    Ainsi, un complément locatif du type à l’école sera bien un COI du verbe aller, dans la mesure où le sens du verbe suppose un complément indiquant le point d’arrivée du mouvement ; en revanche, il sera davantage un complément circonstanciel, à valeur scénique, derrière un verbe comme parler puisque son sémantisme, ou son « drame » pour reprendre la formule de Tesnières, n’implique pas une précision locative au regard du schéma actanciel du verbe où l’on attendrait davantage la personne à qui l’on parle, ou le sujet de la discussion.

    (2a) Je vais à l’école (COI)
    (2b) Je parle à l’école (circonstant) de mathématiques (COI)

    Dans cet article, nous ne reviendrons pas sur ces aspects sémantiques, qui feront l’objet d’un développement approfondi dans un futur billet sur les circonstants. Il y a, en revanche, des éléments syntaxiques assez stables sur lesquels il est bon de revenir ici pour identifier les COI.

    II. Préposition inaugurale et nature syntaxique

    La préposition introduisant le COI demeure l’un de ses traits fondamentaux : c’est ce qui le distingue notamment des COD et des attributs. En revanche, la nature du COI peut être diverse. On peut trouver là des noyaux nominaux (substantifs ou pronoms), des infinitifs (forme « quasi-nominale » du verbe) ou des subordonnées, complétives ou intégratives (dites encore « indéfinies »).

    (3a) Je parle de Pierre / de moi (noyau nominal)
    (3b) Je parle de partir (noyau infinitif)
    (3c) Je parle de ce que je veux (noyau subordonnée complétive)
    (3d) Je parle de qui je veux (noyau subordonnée indéfinie)

    Les prépositions introduisant des COI sont également multiples. Outre la triade à/de/en, composée des prépositions les plus usuelles du français, nous pouvons également trouver, toujours selon le sémantisme du verbe, d’autres prépositions au sens plus transparent comme sur (je m’assois sur une chaise), contre (je m’appuie contre le mur) ou pour (je vote pour mon candidat). On retiendra cependant deux éléments les concernant :

    (i) D’une part, le choix de la préposition est contraint par le verbe. Si certains d’entre eux autorisent, avec différents effets de sens, une certaine variation, la chose est rare en français.

    (4a) Je parle à/de/pour Jean.
    (4b) *Je vais selon l’école

    (ii) D’autre part, il faut que le sens de la préposition, dans le cas où celle-ci n’est pas à, de ou en, soit cohérent avec le verbe. Ainsi, on acceptera volontiers une préposition locative avec un verbe de mouvement (5a), mais il sera plus difficile d’employer une préposition liée au but ou à l’intention (5b).

    (5a) Il parvient jusqu’au sommet.
    (5b) *Il parvient pour le sommet.

    C’est précisément parce qu’il y a cohérence entre le sens du verbe et la préposition qu’historiquement, la réanalyse du circonstant en COI a pu se faire progressivement. On notera d’ailleurs que la préposition permet de distinguer divers sens à un verbe, en fonction du mode de construction du complément :

    (6a) Je connais Jean.
    (6b) Le juge connaît de l’affaire (= « être capable de juger l’affaire »)

    Parfois encore, le choix de la préposition oriente l’interprétation, avec des nuances plus ou moins fines. On a vu récemment, dans la langue moderne, se stabiliser une opposition entre habiter à Paris et habiter sur Paris, la préposition sur indiquant une localisation plus lointaine ou plus vague (à Paris = intra-muros ; sur Paris = dans le voisinage de Paris, en banlieue proche par exemple). Aussi, l’usage continue de modifier la valence verbale en s’appuyant sur la complexité des prépositions, pour déterminer des effets de sens nouveaux.

    III. Règles de transformation

    Certaines règles de transformation syntaxique permettent également d’orienter l’analyse, et de distinguer les « vrais » COI, c’est-à-dire les actants du verbe, d’autres types de groupes prépositionnels, en jouant sur le lien syntaxique que le COI entretient avec son verbe. Notamment les COI peut être pronominalisés en position préverbale :

    (7a) Je parle de Jean <=> J’en parle.
    (7b) Je parle lentement <≠> *Je le parle.
    (7c) Je parle à voix basse <≠> *J’y parle

    Au regard des COD ou des attributs en revanche, les règles de pronominalisation de COI sont un peu plus complexes. On doit notamment distinguer trois régimes de transformation, en fonction et de la nature de la préposition inaugurale, et du statut référentiel du COI selon le paramètre +/- humain. On distinguera alors :

    (i) Un premier régime avec les COI introduits par à. La pronominalisation s’effectue alors soit par y pour les COI -humain (8a), soit par lui pour les COI +humain (8b). Dans ce dernier cas, le pronom lui ne marque pas le genre masculin ou féminin, que ce soit au niveau grammatical ou ontologique.

    (8a) Je réponds à son courrier <=> J’y réponds.
    (8b) Je réponds à Marie <=> Je lui réponds.

    Dans certains cas, la transformation peut s’effectuer en conservant un GP introduit par à, suivi de lui/elle(s)/eux/ça, en parallèle de la pronominalisation en y. C’est un choix fait pour lever, occasionnellement, une ambiguïté interprétative. Ainsi, (9a) est tant la transformation de (9b) que de (9c).

    (9a) J’y pense.
    (9b) Je pense à l’avenir (Je pense à ça)
    (9c) Je pense à mes enfants (Je pense à eux)

    On notera également que y tend néanmoins à se spécialiser dans le non-humain : c’est l’interprétation préférentielle, et certaines variétés diatopiques (dans le lyonnais par exemple) étend cette propriété au COD, pour distinguer la référence des compléments au regard du pronom objet le/la (Je le [Jean] vois vs. J’y [la table] vois).

    (ii) Les COI introduits par de se pronominalisent tous par en. Ce pronom est véritablement lié au mot-forme de, puisqu’on le retrouve également pour la transformation des COD introduits par le partitif ou le déterminant indéfini de. Il faut donc veiller à ne pas confondre les formes entre elles, et de vérifier le statut de de, préposition ou déterminant.

    (10a) Je parle de Jean <=> J’en parle (COI)
    (10b) Je veux de l’eau <=> J’en veux (COD)

    (iii) Enfin, les autres types de COI se pronominalisent sous la forme préposition + pronom pour les animés :

    (11a) Jean tourne autour de Marie <=> Jean tourne autour d’elle.
    (11b) Je compte sur Jean <=> Je compte sur lui.

    Ou, pour les inanimés, par un rappel de la préposition « seule », sans le reste du syntagme.

    (12) J’ai voté contre la loi <=> J’ai voté contre.

    L’identification de ces derniers compléments comme COI est parfois discutée, mais deux arguments peuvent être avancés pour conduire l’analyse : d’une part, la pronominalisation avec lui est encore autorisée pour les animés (13a), même si certaines grammaires associent la transformation à un niveau de langue populaire ou relâchée. D’autre part, le détachement en tête d’énoncé est senti comme incorrect ou maladroit (13b). Or, le COI étant un complément verbal, on ne peut le déplacer librement comme on peut le faire avec un complément à valeur scénique.

    (13a) Jean lui tourne autour.
    (13b) ?Autour de Marie, Jean tourne.

    Ce test de déplacement en tête d’énoncé est d’ailleurs crucial. Si on peut toujours le faire pour les COI, on notera qu’il demande un rappel par cataphore d’un pronom en position préverbale pour assurer la grammaticalité de l’énoncé, ce qui n’y pas le cas des compléments à valeur scénique (14).

    (14a) (À) Jean, je lui parle / ?(À Jean), je parle
    (14b) Sur le quai, j’ (*y) attends.

    La complexité de ces analyses, et le fait qu’elles fassent appel à notre sentiment de langue, empêche cependant d’avoir des certitudes absolues pour certains compléments. En diachronie de même, il est pour ainsi dire impossible de mener la discussion, comme nous ne pouvons pas faire appel à ce sentiment linguistique.

    IV. Conclusions et bibliographie

    Les COI nous rappellent, si besoin était, que rien dans l’analyse de langue n’est absolument indiscutable : les phénomènes grammaticaux ne sont pas des équations mathématiques à résoudre, et une part d’interprétation sera toujours nécessaire dans l’analyse même si des tests et des outils nous permettent d’orienter les discussions. Les COI sont des témoins privilégiés de cette observation, comme ils se situent à la frontière entre les actants du verbe et les circonstants, sans même rentrer dans le terrain, difficile, de l’évolution historique ou de la variation géographique.

    Parmi les références que nous pouvons donner :

    • Jacqueline Pinchon a écrit, en 1972, une étude sur Les pronoms adverbiaux en et y, hélas non réédité. Sa consultation permettra cependant d’y voir plus clair sur cette question épineuse.
    • Outre les références données dans l’article sur les prépositions, qui serviront également pour la discussion, on lira avec attention l‘article de Le Querier (1999), sur Fin de partie de Beckett, pour un point de vue stylistique/sémantique sur la question.

    Site sous licence Creative Commons (CC BY-NC-ND 4.0) : partage autorisé, sous couvert de citation et d’attribution de la source originale. Modification et utilisation commerciale formellement interdites (lien)

    #complément #complémentCirconstanciel #complémentDObjetDirect #complémentDObjetIndirect #grammaire #MathieuGoux #pronom #Syntaxe #valenceVerbale #verbe

  11. File formats as Emoji: 0xffae


    by @beet_keeper

    tldr: emoji.exponentialdecay.co.uk

    File Formats As Emoji (0xFFAE or 0xffae) might be my most random file format hack yet. Indeed, it is a random page generator! But it generates random pages of file formats represented as Emoji.

    The idea came in 2016 with radare releasing a new version that supported an emoji hexdump. I wondered whether I could do something fun combining file

    #0xffae #Code #Coding #digipres #digitalLiteracy #DigitalPreservation #emoji #FileFormat #FileFormatIdentification #FileFormats #learning #PRONOM #pyscript #Python #SkeletonTestCorpus #teaching

  12. Hi @exponentialdecay ! With @Thorsted we are working on Android packages signatures. I notice that the file development utility does not support the priority mechanism, to manage some close matches like APK and AAR. Is it something you or the #PRONOM team would develop?

  13. Chantons la langue avec Jérémie Kisling

    (Il n’y a pas que «La langue de chez nous» dans la vie. Les chansons sur la langue ne manquent pas. Petite anthologie en cours. Liste d’écoute disponible sur Spotify. Suggestions bienvenues.) Jérémie Kisling, «Rendez-vous courtois», le Ours, 2005 One, two, three Allez viens vous asseoir Il faut pas que vous te barres…

    #langue #chanson #pronom #tutoiement #vouvoiement

    oreilletendue.com/2025/01/14/c

  14. Chantons la langue avec Brigitte

    (Il n’y a pas que «La langue de chez nous» dans la vie. Les chansons sur la langue ne manquent pas. Petite anthologie en cours. Liste d’écoute disponible sur Spotify. Suggestions bienvenues.) Brigitte, «Monsieur je t’aime», Et vous, tu m’aimes, 2011 Monsieur je t’aime Monsieur je t’aime Rendez-vous au cinéma Impatiente, infidèle Je ne vous résiste pas…

    #chanson #langue #pronom #tutoiement #vouvoiement

    oreilletendue.com/2025/01/13/c

  15. A year in file formats 2024


    by @beet_keeper

    A great write up from Francesca at TNA about the past year for PRONOM via Georgia and the OPF: digipres.club/@Georgia/1136335.

    It’s great to see the continuing work including vital translation of guides into other languages. Francesca includes a couple of shout outs to some pieces I have contributed in my spare time this year; including a collaborative workshop with Francesca, David, and Tyler at iPRES2024.

    #Archives #Conferences #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #ipres2024 #outreach #PRONOM

  16. PRONOM’s dustiest records

    NB. because of the complexity of this post, it may be easier to read in original blog form, than on Mastodon here: https://exponentialdecay.co.uk/blog/pronoms-dustiest-records/

    Tyler’s recent blog post for the PRONOM Hack-a-thon Week 2024 (my previous for this week), brought up an interesting point about two of PRONOM’s oldest outline records, Real Video Clip (fmt/204) and Real Video (x-fmt/277). How did they end up in PRONOM?

    Tyler suggests:

    I assume PRONOM originally added these based on MIME types available.

    I thought I knew the answer, and so it prompted a forensic look at the records to see if what I thought I knew aligned with reality!

    As a PRONOM maintainer at The National Archives, UK from 2009-2012 I knew a little bit of the history of the system, we see some of that history impact us today, for example, when we look at the number of records that don’t have descriptions or file format signatures, 156 of those records are so-called x-PUIDs. A mechanism in PRONOM that was never meant to make it into the wild for working on file formats internally without polluting the public record. There are 455 x-PUIDs in total. They made it into the wild anyway (before my time) and so they exist as a symbol of PRONOM’s dustiest oldest records.

    Even by the time I had started, PRONOM still had a lot of what we started to call outline records. One of the more positive changes we made to the process back in the day was that we would stop creating outline records; instead, we would focus on records that could be tied to signatures. This didn’t necessarily make the records more correctly aligned with reality, but it meant records had utility and file formats identified by DROID could be tied back to something that PRONOM “knew about”. I believe the process is a bit more flexible these days, allowing individuals to contribute information to records that tie them back to information like MIMEtypes and specifications. It’s clearer the format is “real” even if a signature is yet to be developed (and of course there are a large number of data formats that are hard to even represent in traditional PRONOM signatures any more and so they need a record, even if there isn’t a neat concept of a signature for them).

    Okay old-man, but what about Tyler’s thesis?

    Stellent and PRONOM

    I learned sometime in my tenure at The National Archives that PRONOM had been seeded with a lot of the formats listed in a technology called OutsideIn previously owned by Stellent and now owned by Oracle.

    Oracle OutsideInhttps://docs.oracle.com/outsidein/853/oit/OutsideIn (2010)https://web.archive.org/web/20101016164937/http://www.oracle.com/technetwork/middleware/content-management/oit-all-085236.htmlData sheet – Formats (2011)https://web.archive.org/web/20110125024733/http://www.oracle.com/technetwork/middleware/content-management/ds-oitfiles-133032.pdfCOPTR entryhttps://coptr.digipres.org/index.php/Oracle_Outside_In_Technology

    I had always had a feeling that that the scope of this list was largely exaggerated by the company selling the software as it is a marketing tool; and if not exaggerated, perhaps, just not as clearly delineated by format than PRONOM, and rather, by Software, regardless of the properties of a given “format”, e.g. WinZip, and PKZip.

    Back to the story though, I was also reasonably sure I would find Tyler’s RealVideo formats in the format listing but, I did not!

    I downloaded a CSV summarizing the PRONOM records from api.pronom.ffdev.info with:

    curl -X 'GET' \
     'https://api.pronom.ffdev.info/pronom_summary_csv' \
     -H 'accept: application/csv'

    I filtered on outline entries and those without signatures only. I went through the entries still remaining and looked for name matches. I did find some name-for-name matches and some that were close, but no RealVideo or RealVideo Clip.

    The matches:

    7-bit ANSI Textyes7-bit ASCII Textyes8-bit ANSI Textyes8-bit ASCII TextyesEBCDIC-USyesFramework Database IIIyesIBM DisplayWrite Document 2yesIBM DisplayWrite Document 3yesMicrografx Designer 3.1yesNota Bene Text FileyesUnicode Text Fileyes

    The maybes:

    Cascading Style SheetmaybeFreelance File 1.0-2.1maybeMacPaint GraphicsmaybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Works DatabasemaybeMicrosoft Works Database for DOS 2.0maybeMicrosoft Works Database for Windows 3.0maybeMicrosoft Works Database for Windows 4.0maybeProfessional Write Text FilemaybeWordPerfect for Windows Document 5.2maybeXYWrite DocumentmaybeXYWrite Document IIImaybeXYWrite Document III+maybe

    11 exact matches! It’s hardly a headline!

    I had hoped that if I found more exact matches it would provide some clues to where some of the older PRONOM entries came from. I expected most of the outline records to come from this list, alas, it isn’t nearly as many as anticipated.

    I hoped too that going through the list I might get more clues as to formats that could potentially be deprecated in PRONOM.

    As it stands, from the OutsideIn list, the only records I would personally recommend for deprecation are:

    7-bit ANSI Text7-bit ASCII Text8-bit ANSI Text8-bit ASCII TextEBCDIC-USUnicode Text File

    We know enough now to be almost certain that if something that looks like these files arrives in the archive it will present as a standard text file, and that we will need to rely on determining the character encoding using tools such as Richard Lehane’s characterize (see characterize’s README for more background). It is unlikely we will be able to attach a signature to these records, and we know there are a great deal more encodings in the world than need be represented as PRONOM identifiers.

    NB. this might be something to formalize in a PRONOM decision making rubric, connected also, to formalizing approaches for XML based signatures.

    A bit of a let down, or is it?

    Still uncomfortable with so many outline records and little provenance for them, I wanted to find more information about the source of PRONOM data and so I decided to take a different path — I surfed the internet for answers!

    Out of the list of outline records I found a few to be overly specific, or slightly weird, i.e. not really things we hear much about day-to-day, some examples:

    ACBM GraphicsApple SoundAutoCAD Plot Configuration File 1.0-R13AutoCAD Plot Configuration File R14AutoSketch DrawingBtrieve Database 5.1CorelDraw PatternDEC Data Exchange FileDEC WPS Plus DocumentDr Halo BitmapGeneric Library FileHTML Extension FileHewlett Packard AdvanceWrite Text FileInkwriter/Notetaker TemplateInset Systems BitmapInstalit ScriptInterleaf DocumentMicrosoft Excel Add-InMicrosoft Excel ODBC QueryMicrosoft Excel ToolbarMicrosoft Powerpoint Design TemplateMicrosoft Print FileMicrostation CAD Drawing 95NAP MetafileNota Bene Text FileOS/2 Change Control FileRevit External GroupSAP DocumentSAS Data FileScanstudio 16-Colour BitmapSchedule+ ContactsSpeller Custom DictionaryUnisys (Sperry) System Data FileWordperfect Secondary File 5.0Wordperfect Secondary File 5.1/5.2form*Z Project File

    ACBM graphics? Dr Halo Bitmap? Btrieve database, “5.1”? where are the other five?!!

    It gave me pause. I didn’t believe these were all formats well-known to folks who created PRONOM, and I know we didn’t have such an advanced digital transfer program at the time that meant agencies were submitting huge variations of formats to PRONOM for future preservation.

    I felt they had to come from somewhere, but where?

    Enter Filext.com

    Because these formats were very specific I found listings on the internet that I knew had to be part of the story. I had immediate luck just looking for combinations of these names, e.g. ACBM Graphics + NAP Metafile.

    In particular I found listings on different websites from hobbyists or universities that all looked the same or similar, e.g.

    There were definite matches with PRONOM which we will get to, but I started to wonder about the provenance of these extensions.

    I kept looking and I found one clue, a header and footer of a file that looked like those above and read as follows:

    Copyright © 2002 Computer KnowledgeAll Rights ReservedThis download for personal use only. Do NOT distributeit to others either alone or incorporated into anysoftware without prior permission from Computer Knowledge.Developers who wish to incorporate portions of the listplease see the comments at the end of this file.
    Developer permissions....This total file may not be included in any other software orproject which presents the data to the public or portions ofthe public. Any developer who wishes to include up to (butnot more than) 2,000 individual entries from this file is freeto do so provided certain conditions are met. These are:.  1) Credit must be given to FILExt. If links are available  in the developed product then one must also be provided to  FILExt as http://filext.com..  Suggested text: "File extension list courtesy of FILExt.  For a more extensive list visit http://filext.com.".  2) Once the extensions are chosen for one product by any  developer then these same extensions must continue to be  used by that developer for any other projects (i.e., you  cannot take one set of 2,000 for one project and a different  set of 2,000 for another project; it's a total of 2,000)..  3) If links are available in the developed product then any  links appearing associated with any of the 2,000 picked  extensions must be included in the product. (This covers  future plans to include such links in this list.).When the project is complete please notify FILExt with thespecifics at [email protected]. We're always interestedin how the list is being used. Thank you.

    Filext.com!

    And so I asked myself, how long had filext been around?

    As it turns out, quite a while! It was forked from a site called cknow around 2002. cknow.com was registered around 1996 and filext.com registered in 2001.

    The first appearance of cknow in the internet archive is late 1996: https://web.archive.org/web/19961219035827/http://www.cknow.com/ and Filext early 2001: https://web.archive.org/web/20010522235126/http://www.filext.com/

    The sites were founded by Tom Simondi. It looks like he has been responsible for a lot of the 90s and 00s work around demystifying extensions and getting more information to folk about what to do with them.

    Could it be the source of the first PRONOM records?

    Comparing some of the many other text-based lists I had found with cknow and filext gave me some confidence that there was some shared heritage with the them, and so I asked, could the cknow and filext lists have also seeded PRONOM?

    I picked a list close to 2002 (cknow Extensions: 2000) when PRONOM was first started and began to compare entries for exact matches.

    ACBM GraphicsyesAutoCAD Compiled MenuyesAutoSketch DrawingyesBtrieve Database 5.1yesDataFlex Query Tag NameyesDeluxe Paint bitmapyesDesignCAD DrawingyesDigital VideoyesDr Halo BitmapyesFrame Vector MetafileyesFramework Database IIyesFramework Database IIIyesFramework Database IVyesInformation or Setup FileyesInset Systems BitmapyesInterBase DatabaseyesLotus Approach View FileyesMathematica NotebookyesMicrosoft Excel Add-InyesMicrosoft Excel ODBC QueryyesMicrosoft Excel OLAP QueryyesMicrosoft Excel OLE DB QueryyesMicrosoft Excel Web QueryyesMicrosoft FoxPro LibraryyesMicrosoft Outlook Address BookyesMicrosoft PowerPoint Graphics FileyesMicrosoft Powerpoint Add-InyesMicrosoft Visual FoxPro TableyesMicrosoft Works DatabaseyesMicrosoft Works DocumentyesMicrostation CAD Drawing 95yesNAP MetafileyesNota Bene Text FileyesOS/2 Change Control FileyesPICS AnimationyesPageMaker Document 3.0yesPageMaker Time Stamp File 4.0yesProfessional Write Text FileyesQuicken Data FileyesRealVideo Clip <– cc. Tyler!yesSchedule+ ContactsyesStatGraphics Data FileyesStructured Query Language DatayesVentura Publisher Vector GraphicsyesXYWrite Document IIIyesXYWrite Document IVyes

    46 matches!

    Apple SoundmaybeAutoCAD Device-Independent Binary Plotter FilemaybeAutoCAD Drawing TemplatemaybeCascading Style SheetmaybeDEC Data Exchange FilemaybeDEC WPS Plus DocumentmaybeFreelance File 1.0-2.1maybeJava Servlet PagemaybeMicrografx Designer 3.1maybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Office Binder Template for Windows 95maybeMicrosoft Office Binder Template for Windows 97-2003maybeMicrosoft Office Binder Wizard for Windows 95maybeMicrosoft Office Binder Wizard for Windows 97-2003maybeVentura PublishermaybeXYWrite DocumentmaybeXYWrite Document III+maybe

    17 maybes!

    What did we answer?

    Okay, 46 exact matches does not the full listing make (although many (now) full-entries may still have been made from these early listings). Filext may have been an important resource for the first PRONOM records, but it’s also likely that PRONOM had other sources of information. For example, for a number of the Microsoft formats with outline records read like export or save-as listings in previous versions of Microsoft software. E.g. Excel:

    NB. I wasn’t actively researching this side of things writing this blog, but I can already see some commonalities, especially Unicode Text!

    I know we also had a copy of the Dr Dobb’s Essential Books on File Formats CD-ROM in the archive, and so that may also have been an important resource when PRONOM was creating its first records.

    I count only two overlaps with the Stellent list, Framework Database III and Nota Bene Text File.

    We did, however, find the RealVideo Clip! And I think we found some decent correlation with a resource that looks likely to have been used partially to populate the PRONOM database.

    The era of file extensions

    • Throughout my research, I found a lot of similar websites. Filext seems to go furthest back and has the greater pedigree, but in the noughties a lot of other sites seemed to appear to try and provide similar information to internet users, a few of note that seemed comprehensive and particularly well presented:

    I am sure we looked at these sites during my time on PRONOM, although with less frequency given the need to reduce outline records and increase the number with actionable information.

    NB. I also  learned that TrID has been around since 2003! https://web.archive.org/web/20030612031252/http://mark0.ngi.it:80/

    Provenance and prior art

    It’s not entirely productive to say I wish we had better provenance for PRONOM records back in the day – but I do!

    It makes me reflect on the importance of looking outside of our own walls in digital preservation instead of the constant redundancy of reinvention or ownership.

    Often as academics, or those with archival views of the world, we can provide a polish and precision to technology as it exists to make it more usable in an archival context.

    But cknow has been around so long, and the Unix utility File was created in 1986.

    There’s a parallel history here that we should be recognizing and sharing for our next colleagues.

    I arrived at TNA in 2009 and learned about File maybe two years later. As a Windows guy at the time, that might not be uncommon, but I do feel it is on me to have known more. I also think it should have been trivial to access the provenance around some of the records in the database at the time, but more than that – as a field, shouldn’t we all know Tom Simondi? What if the same academic rigour of PRONOM and DROID could have been applied to existing tools like File? What if we had expanded our bubble and recognized digital preservation (or the tools for it) is something people have been doing in all but name for the longest time? What if the people working in parallel on these projects and websites were part of the digital preservation inner-circle community today?

    I don’t have answers, but I feel there are lessons there for the future. Not reinventing or rebuilding without good reason is important, but even if we build something new and we have been inspired by something else, continuing to recognize and acknowledge prior art is important.

    What do you think?

    Also, how do we get these people into a room and celebrate their work, and learn more!

    What next?

    I don’t think I got very far here but I found it interesting, and I hope other readers may as well.

    This is meant to be a PRONOM hack-a-thon blog and I don’t know if I have pushed the sticks forward that much but maybe there’s a bit more to reason about in the outline records, for example, around the plain-text formats mentioned above and a few more identified along the way.

    7-bit ANSI Textx-fmt/21Recommend deprecation7-bit ASCII Textx-fmt/22Recommend deprecation8-bit ANSI Textx-fmt/282Recommend deprecation8-bit ASCII Textx-fmt/283Recommend deprecationUnicode Text Filex-fmt/16Recommend deprecationEBCDIC-USfmt/159Recommend deprecationMS-DOS Text File with line breaksx-fmt/130Recommend deprecation

    I noticed in the outline entries some low-hanging fruit that I might focus on next opportunity if someone else doesn’t get there first, these would be:

    Cascading Style Sheetx-fmt/224Consider adding CSS to the record nameA signature should be feasibleDocument Type Definitionx-fmt/315Consider adding DTD to the record nameA signature should be feasibleExtensible Stylesheet Languagex-fmt/281Consider adding XSL to the record nameA signature should be feasibleHTML Extension Filex-fmt/417Related to Microsoft’s ISS serverA signature may be possibleStandard Generalized Markup Languagex-fmt/195Consider adding SGML to the record nameA signature may be possibleStill Picture Interchange File Format 2.0fmt/113Related to JPEGA signature should be possibleStructured Query Language Datafmt/206Consider adding SQL to the record nameA signature may be possibleDreamweaver Lock Filefmt/335A system file, there may be an entry in the NSRL databaseA signature may be possible

    A little more on the history of extensions websites

    The complete filext text file (allext.zip)

    It took a few jumps, but I found the complete downloadable text file from Filext.com. I don’t think it exists any more and I don’t think the internet archive managed to grab a copy. Apparently it was quite a chunk of data to download on the web once upon a time, but they eventually found a way to release a zipped text file:

    Via one jump we get to the “whole list” page:

    https://web.archive.org/web/20020605164206/http://filext.com/wholelist.htm

    And then to confirm our absolute interest in downloading it, we get to the a2z file:

    https://web.archive.org/web/20020606071418/http://filext.com/a2z.htm

    Which would have taken us to the zip file, alas, never captured on the Internet Archive anyway, maybe it is on other Memento compatible servers:

    https://web.archive.org/web/20060117000000*/http://www.filext.com:80/allext.zip

    Keeping filext up to date

    Filext still asks for registry data to help keep it up to date. That’s pretty cool!

    https://filext.com/faq/gather_data_for_filext.html

    1 │ Echo OFF
    2 │ CLS
    3 │ assoc > filext_submission_output.txt
    4 │ Echo ---------- >> filext_submission_output.txt
    5 │ ftype >> filext_submission_output.txt
    6 │ Echo Thank you. The output file has been created and
    7 │ Echo named filext_submission_output.txt and it should
    8 │ Echo be in the same place where you saved this batch
    9 │ Echo file. All that is left now is to send that file
    10 │ Echo to FILExt. Attach it to an E-mail sent to the
    11 │ Echo address: [email protected]
    12 │ Echo The E-mail subject should be: Submission
    13 │ Echo Thank you.
    14 │ Pause
    15 │ Exit

    Filext as a source of learning

    The filext faqs and community seemed particularly helpful and interesting back in the day:

    https://web.archive.org/web/20090322040812/http://filext.com/faq/

    File extension aggregator

    The file-extension.net website started an aggregator project around 2007 and it’s still running today!

    http://file-extension.net/seeker/

    Some bonus images…

    As I was working on this, I found irony in Google Sheets glitching, I managed to grab some screenshots along the way. Thanks for reading everyone!

    #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #PRONOM #WDPD #WDPD2024

  17. PRONOM’s dustiest records


    by @beet_keeper

    Tyler’s recent blog post for the PRONOM Hack-a-thon Week 2024 (my previous for this week), brought up an interesting point about two of PRONOM’s oldest outline records, Real Video Clip (fmt/204) and Real Video (x-fmt/277). How did they end up in PRONOM?

    NB. because of the complexity of this post, it may be easier to read in original blog form, than on Mastodon here: https://exponentialdecay.co.uk/blog/pronoms-dustiest-records/

    Tyler suggests:

    I assume PRONOM originally added these based on MIME types available.

    I thought I knew the answer, but it prompted a forensic look at the records to see if what I thought I knew aligned with reality!

    Continue reading “PRONOM’s dustiest records”

    #digipres #digitalPreservation #droid #fileFormat #fileFormats #oit #oracle #outsidein #pronom #stellent #wdpd #wdpd2024

  18. PRONOM’s dustiest records

    NB. because of the complexity of this post, it may be easier to read in original blog form, than on Mastodon here: https://exponentialdecay.co.uk/blog/pronoms-dustiest-records/

    Tyler’s recent blog post for the PRONOM Hack-a-thon Week 2024 (my previous for this week), brought up an interesting point about two of PRONOM’s oldest outline records, Real Video Clip (fmt/204) and Real Video (x-fmt/277). How did they end up in PRONOM?

    Tyler suggests:

    I assume PRONOM originally added these based on MIME types available.

    I thought I knew the answer, and so it prompted a forensic look at the records to see if what I thought I knew aligned with reality!

    As a PRONOM maintainer at The National Archives, UK from 2009-2012 I knew a little bit of the history of the system, we see some of that history impact us today, for example, when we look at the number of records that don’t have descriptions or file format signatures, 156 of those records are so-called x-PUIDs. A mechanism in PRONOM that was never meant to make it into the wild for working on file formats internally without polluting the public record. There are 455 x-PUIDs in total. They made it into the wild anyway (before my time) and so they exist as a symbol of PRONOM’s dustiest oldest records.

    Even by the time I had started, PRONOM still had a lot of what we started to call outline records. One of the more positive changes we made to the process back in the day was that we would stop creating outline records; instead, we would focus on records that could be tied to signatures. This didn’t necessarily make the records more correctly aligned with reality, but it meant records had utility and file formats identified by DROID could be tied back to something that PRONOM “knew about”. I believe the process is a bit more flexible these days, allowing individuals to contribute information to records that tie them back to information like MIMEtypes and specifications. It’s clearer the format is “real” even if a signature is yet to be developed (and of course there are a large number of data formats that are hard to even represent in traditional PRONOM signatures any more and so they need a record, even if there isn’t a neat concept of a signature for them).

    Okay old-man, but what about Tyler’s thesis?

    Stellent and PRONOM

    I learned sometime in my tenure at The National Archives that PRONOM had been seeded with a lot of the formats listed in a technology called OutsideIn previously owned by Stellent and now owned by Oracle.

    Oracle OutsideInhttps://docs.oracle.com/outsidein/853/oit/OutsideIn (2010)https://web.archive.org/web/20101016164937/http://www.oracle.com/technetwork/middleware/content-management/oit-all-085236.htmlData sheet – Formats (2011)https://web.archive.org/web/20110125024733/http://www.oracle.com/technetwork/middleware/content-management/ds-oitfiles-133032.pdfCOPTR entryhttps://coptr.digipres.org/index.php/Oracle_Outside_In_Technology

    I had always had a feeling that that the scope of this list was largely exaggerated by the company selling the software as it is a marketing tool; and if not exaggerated, perhaps, just not as clearly delineated by format than PRONOM, and rather, by Software, regardless of the properties of a given “format”, e.g. WinZip, and PKZip.

    Back to the story though, I was also reasonably sure I would find Tyler’s RealVideo formats in the format listing but, I did not!

    I downloaded a CSV summarizing the PRONOM records from api.pronom.ffdev.info with:

    curl -X 'GET' \
     'https://api.pronom.ffdev.info/pronom_summary_csv' \
     -H 'accept: application/csv'

    I filtered on outline entries and those without signatures only. I went through the entries still remaining and looked for name matches. I did find some name-for-name matches and some that were close, but no RealVideo or RealVideo Clip.

    The matches:

    7-bit ANSI Textyes7-bit ASCII Textyes8-bit ANSI Textyes8-bit ASCII TextyesEBCDIC-USyesFramework Database IIIyesIBM DisplayWrite Document 2yesIBM DisplayWrite Document 3yesMicrografx Designer 3.1yesNota Bene Text FileyesUnicode Text Fileyes

    The maybes:

    Cascading Style SheetmaybeFreelance File 1.0-2.1maybeMacPaint GraphicsmaybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Works DatabasemaybeMicrosoft Works Database for DOS 2.0maybeMicrosoft Works Database for Windows 3.0maybeMicrosoft Works Database for Windows 4.0maybeProfessional Write Text FilemaybeWordPerfect for Windows Document 5.2maybeXYWrite DocumentmaybeXYWrite Document IIImaybeXYWrite Document III+maybe

    11 exact matches! It’s hardly a headline!

    I had hoped that if I found more exact matches it would provide some clues to where some of the older PRONOM entries came from. I expected most of the outline records to come from this list, alas, it isn’t nearly as many as anticipated.

    I hoped too that going through the list I might get more clues as to formats that could potentially be deprecated in PRONOM.

    As it stands, from the OutsideIn list, the only records I would personally recommend for deprecation are:

    7-bit ANSI Text7-bit ASCII Text8-bit ANSI Text8-bit ASCII TextEBCDIC-USUnicode Text File

    We know enough now to be almost certain that if something that looks like these files arrives in the archive it will present as a standard text file, and that we will need to rely on determining the character encoding using tools such as Richard Lehane’s characterize (see characterize’s README for more background). It is unlikely we will be able to attach a signature to these records, and we know there are a great deal more encodings in the world than need be represented as PRONOM identifiers.

    NB. this might be something to formalize in a PRONOM decision making rubric, connected also, to formalizing approaches for XML based signatures.

    A bit of a let down, or is it?

    Still uncomfortable with so many outline records and little provenance for them, I wanted to find more information about the source of PRONOM data and so I decided to take a different path — I surfed the internet for answers!

    Out of the list of outline records I found a few to be overly specific, or slightly weird, i.e. not really things we hear much about day-to-day, some examples:

    ACBM GraphicsApple SoundAutoCAD Plot Configuration File 1.0-R13AutoCAD Plot Configuration File R14AutoSketch DrawingBtrieve Database 5.1CorelDraw PatternDEC Data Exchange FileDEC WPS Plus DocumentDr Halo BitmapGeneric Library FileHTML Extension FileHewlett Packard AdvanceWrite Text FileInkwriter/Notetaker TemplateInset Systems BitmapInstalit ScriptInterleaf DocumentMicrosoft Excel Add-InMicrosoft Excel ODBC QueryMicrosoft Excel ToolbarMicrosoft Powerpoint Design TemplateMicrosoft Print FileMicrostation CAD Drawing 95NAP MetafileNota Bene Text FileOS/2 Change Control FileRevit External GroupSAP DocumentSAS Data FileScanstudio 16-Colour BitmapSchedule+ ContactsSpeller Custom DictionaryUnisys (Sperry) System Data FileWordperfect Secondary File 5.0Wordperfect Secondary File 5.1/5.2form*Z Project File

    ACBM graphics? Dr Halo Bitmap? Btrieve database, “5.1”? where are the other five?!!

    It gave me pause. I didn’t believe these were all formats well-known to folks who created PRONOM, and I know we didn’t have such an advanced digital transfer program at the time that meant agencies were submitting huge variations of formats to PRONOM for future preservation.

    I felt they had to come from somewhere, but where?

    Enter Filext.com

    Because these formats were very specific I found listings on the internet that I knew had to be part of the story. I had immediate luck just looking for combinations of these names, e.g. ACBM Graphics + NAP Metafile.

    In particular I found listings on different websites from hobbyists or universities that all looked the same or similar, e.g.

    There were definite matches with PRONOM which we will get to, but I started to wonder about the provenance of these extensions.

    I kept looking and I found one clue, a header and footer of a file that looked like those above and read as follows:

    Copyright © 2002 Computer KnowledgeAll Rights ReservedThis download for personal use only. Do NOT distributeit to others either alone or incorporated into anysoftware without prior permission from Computer Knowledge.Developers who wish to incorporate portions of the listplease see the comments at the end of this file.
    Developer permissions....This total file may not be included in any other software orproject which presents the data to the public or portions ofthe public. Any developer who wishes to include up to (butnot more than) 2,000 individual entries from this file is freeto do so provided certain conditions are met. These are:.  1) Credit must be given to FILExt. If links are available  in the developed product then one must also be provided to  FILExt as http://filext.com..  Suggested text: "File extension list courtesy of FILExt.  For a more extensive list visit http://filext.com.".  2) Once the extensions are chosen for one product by any  developer then these same extensions must continue to be  used by that developer for any other projects (i.e., you  cannot take one set of 2,000 for one project and a different  set of 2,000 for another project; it's a total of 2,000)..  3) If links are available in the developed product then any  links appearing associated with any of the 2,000 picked  extensions must be included in the product. (This covers  future plans to include such links in this list.).When the project is complete please notify FILExt with thespecifics at [email protected]. We're always interestedin how the list is being used. Thank you.

    Filext.com!

    And so I asked myself, how long had filext been around?

    As it turns out, quite a while! It was forked from a site called cknow around 2002. cknow.com was registered around 1996 and filext.com registered in 2001.

    The first appearance of cknow in the internet archive is late 1996: https://web.archive.org/web/19961219035827/http://www.cknow.com/ and Filext early 2001: https://web.archive.org/web/20010522235126/http://www.filext.com/

    The sites were founded by Tom Simondi. It looks like he has been responsible for a lot of the 90s and 00s work around demystifying extensions and getting more information to folk about what to do with them.

    Could it be the source of the first PRONOM records?

    Comparing some of the many other text-based lists I had found with cknow and filext gave me some confidence that there was some shared heritage with the them, and so I asked, could the cknow and filext lists have also seeded PRONOM?

    I picked a list close to 2002 (cknow Extensions: 2000) when PRONOM was first started and began to compare entries for exact matches.

    ACBM GraphicsyesAutoCAD Compiled MenuyesAutoSketch DrawingyesBtrieve Database 5.1yesDataFlex Query Tag NameyesDeluxe Paint bitmapyesDesignCAD DrawingyesDigital VideoyesDr Halo BitmapyesFrame Vector MetafileyesFramework Database IIyesFramework Database IIIyesFramework Database IVyesInformation or Setup FileyesInset Systems BitmapyesInterBase DatabaseyesLotus Approach View FileyesMathematica NotebookyesMicrosoft Excel Add-InyesMicrosoft Excel ODBC QueryyesMicrosoft Excel OLAP QueryyesMicrosoft Excel OLE DB QueryyesMicrosoft Excel Web QueryyesMicrosoft FoxPro LibraryyesMicrosoft Outlook Address BookyesMicrosoft PowerPoint Graphics FileyesMicrosoft Powerpoint Add-InyesMicrosoft Visual FoxPro TableyesMicrosoft Works DatabaseyesMicrosoft Works DocumentyesMicrostation CAD Drawing 95yesNAP MetafileyesNota Bene Text FileyesOS/2 Change Control FileyesPICS AnimationyesPageMaker Document 3.0yesPageMaker Time Stamp File 4.0yesProfessional Write Text FileyesQuicken Data FileyesRealVideo Clip <– cc. Tyler!yesSchedule+ ContactsyesStatGraphics Data FileyesStructured Query Language DatayesVentura Publisher Vector GraphicsyesXYWrite Document IIIyesXYWrite Document IVyes

    46 matches!

    Apple SoundmaybeAutoCAD Device-Independent Binary Plotter FilemaybeAutoCAD Drawing TemplatemaybeCascading Style SheetmaybeDEC Data Exchange FilemaybeDEC WPS Plus DocumentmaybeFreelance File 1.0-2.1maybeJava Servlet PagemaybeMicrografx Designer 3.1maybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Office Binder Template for Windows 95maybeMicrosoft Office Binder Template for Windows 97-2003maybeMicrosoft Office Binder Wizard for Windows 95maybeMicrosoft Office Binder Wizard for Windows 97-2003maybeVentura PublishermaybeXYWrite DocumentmaybeXYWrite Document III+maybe

    17 maybes!

    What did we answer?

    Okay, 46 exact matches does not the full listing make (although many (now) full-entries may still have been made from these early listings). Filext may have been an important resource for the first PRONOM records, but it’s also likely that PRONOM had other sources of information. For example, for a number of the Microsoft formats with outline records read like export or save-as listings in previous versions of Microsoft software. E.g. Excel:

    NB. I wasn’t actively researching this side of things writing this blog, but I can already see some commonalities, especially Unicode Text!

    I know we also had a copy of the Dr Dobb’s Essential Books on File Formats CD-ROM in the archive, and so that may also have been an important resource when PRONOM was creating its first records.

    I count only two overlaps with the Stellent list, Framework Database III and Nota Bene Text File.

    We did, however, find the RealVideo Clip! And I think we found some decent correlation with a resource that looks likely to have been used partially to populate the PRONOM database.

    The era of file extensions

    • Throughout my research, I found a lot of similar websites. Filext seems to go furthest back and has the greater pedigree, but in the noughties a lot of other sites seemed to appear to try and provide similar information to internet users, a few of note that seemed comprehensive and particularly well presented:

    I am sure we looked at these sites during my time on PRONOM, although with less frequency given the need to reduce outline records and increase the number with actionable information.

    NB. I also  learned that TrID has been around since 2003! https://web.archive.org/web/20030612031252/http://mark0.ngi.it:80/

    Provenance and prior art

    It’s not entirely productive to say I wish we had better provenance for PRONOM records back in the day – but I do!

    It makes me reflect on the importance of looking outside of our own walls in digital preservation instead of the constant redundancy of reinvention or ownership.

    Often as academics, or those with archival views of the world, we can provide a polish and precision to technology as it exists to make it more usable in an archival context.

    But cknow has been around so long, and the Unix utility File was created in 1986.

    There’s a parallel history here that we should be recognizing and sharing for our next colleagues.

    I arrived at TNA in 2009 and learned about File maybe two years later. As a Windows guy at the time, that might not be uncommon, but I do feel it is on me to have known more. I also think it should have been trivial to access the provenance around some of the records in the database at the time, but more than that – as a field, shouldn’t we all know Tom Simondi? What if the same academic rigour of PRONOM and DROID could have been applied to existing tools like File? What if we had expanded our bubble and recognized digital preservation (or the tools for it) is something people have been doing in all but name for the longest time? What if the people working in parallel on these projects and websites were part of the digital preservation inner-circle community today?

    I don’t have answers, but I feel there are lessons there for the future. Not reinventing or rebuilding without good reason is important, but even if we build something new and we have been inspired by something else, continuing to recognize and acknowledge prior art is important.

    What do you think?

    Also, how do we get these people into a room and celebrate their work, and learn more!

    What next?

    I don’t think I got very far here but I found it interesting, and I hope other readers may as well.

    This is meant to be a PRONOM hack-a-thon blog and I don’t know if I have pushed the sticks forward that much but maybe there’s a bit more to reason about in the outline records, for example, around the plain-text formats mentioned above and a few more identified along the way.

    7-bit ANSI Textx-fmt/21Recommend deprecation7-bit ASCII Textx-fmt/22Recommend deprecation8-bit ANSI Textx-fmt/282Recommend deprecation8-bit ASCII Textx-fmt/283Recommend deprecationUnicode Text Filex-fmt/16Recommend deprecationEBCDIC-USfmt/159Recommend deprecationMS-DOS Text File with line breaksx-fmt/130Recommend deprecation

    I noticed in the outline entries some low-hanging fruit that I might focus on next opportunity if someone else doesn’t get there first, these would be:

    Cascading Style Sheetx-fmt/224Consider adding CSS to the record nameA signature should be feasibleDocument Type Definitionx-fmt/315Consider adding DTD to the record nameA signature should be feasibleExtensible Stylesheet Languagex-fmt/281Consider adding XSL to the record nameA signature should be feasibleHTML Extension Filex-fmt/417Related to Microsoft’s ISS serverA signature may be possibleStandard Generalized Markup Languagex-fmt/195Consider adding SGML to the record nameA signature may be possibleStill Picture Interchange File Format 2.0fmt/113Related to JPEGA signature should be possibleStructured Query Language Datafmt/206Consider adding SQL to the record nameA signature may be possibleDreamweaver Lock Filefmt/335A system file, there may be an entry in the NSRL databaseA signature may be possible

    A little more on the history of extensions websites

    The complete filext text file (allext.zip)

    It took a few jumps, but I found the complete downloadable text file from Filext.com. I don’t think it exists any more and I don’t think the internet archive managed to grab a copy. Apparently it was quite a chunk of data to download on the web once upon a time, but they eventually found a way to release a zipped text file:

    Via one jump we get to the “whole list” page:

    https://web.archive.org/web/20020605164206/http://filext.com/wholelist.htm

    And then to confirm our absolute interest in downloading it, we get to the a2z file:

    https://web.archive.org/web/20020606071418/http://filext.com/a2z.htm

    Which would have taken us to the zip file, alas, never captured on the Internet Archive anyway, maybe it is on other Memento compatible servers:

    https://web.archive.org/web/20060117000000*/http://www.filext.com:80/allext.zip

    Keeping filext up to date

    Filext still asks for registry data to help keep it up to date. That’s pretty cool!

    https://filext.com/faq/gather_data_for_filext.html

    1 │ Echo OFF
    2 │ CLS
    3 │ assoc > filext_submission_output.txt
    4 │ Echo ---------- >> filext_submission_output.txt
    5 │ ftype >> filext_submission_output.txt
    6 │ Echo Thank you. The output file has been created and
    7 │ Echo named filext_submission_output.txt and it should
    8 │ Echo be in the same place where you saved this batch
    9 │ Echo file. All that is left now is to send that file
    10 │ Echo to FILExt. Attach it to an E-mail sent to the
    11 │ Echo address: [email protected]
    12 │ Echo The E-mail subject should be: Submission
    13 │ Echo Thank you.
    14 │ Pause
    15 │ Exit

    Filext as a source of learning

    The filext faqs and community seemed particularly helpful and interesting back in the day:

    https://web.archive.org/web/20090322040812/http://filext.com/faq/

    File extension aggregator

    The file-extension.net website started an aggregator project around 2007 and it’s still running today!

    http://file-extension.net/seeker/

    Some bonus images…

    As I was working on this, I found irony in Google Sheets glitching, I managed to grab some screenshots along the way. Thanks for reading everyone!

    #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #PRONOM #WDPD #WDPD2024

  19. PRONOM’s dustiest records

    NB. because of the complexity of this post, it may be easier to read in original blog form, than on Mastodon here: https://exponentialdecay.co.uk/blog/pronoms-dustiest-records/

    Tyler’s recent blog post for the PRONOM Hack-a-thon Week 2024 (my previous for this week), brought up an interesting point about two of PRONOM’s oldest outline records, Real Video Clip (fmt/204) and Real Video (x-fmt/277). How did they end up in PRONOM?

    Tyler suggests:

    I assume PRONOM originally added these based on MIME types available.

    I thought I knew the answer, and so it prompted a forensic look at the records to see if what I thought I knew aligned with reality!

    As a PRONOM maintainer at The National Archives, UK from 2009-2012 I knew a little bit of the history of the system, we see some of that history impact us today, for example, when we look at the number of records that don’t have descriptions or file format signatures, 156 of those records are so-called x-PUIDs. A mechanism in PRONOM that was never meant to make it into the wild for working on file formats internally without polluting the public record. There are 455 x-PUIDs in total. They made it into the wild anyway (before my time) and so they exist as a symbol of PRONOM’s dustiest oldest records.

    Even by the time I had started, PRONOM still had a lot of what we started to call outline records. One of the more positive changes we made to the process back in the day was that we would stop creating outline records; instead, we would focus on records that could be tied to signatures. This didn’t necessarily make the records more correctly aligned with reality, but it meant records had utility and file formats identified by DROID could be tied back to something that PRONOM “knew about”. I believe the process is a bit more flexible these days, allowing individuals to contribute information to records that tie them back to information like MIMEtypes and specifications. It’s clearer the format is “real” even if a signature is yet to be developed (and of course there are a large number of data formats that are hard to even represent in traditional PRONOM signatures any more and so they need a record, even if there isn’t a neat concept of a signature for them).

    Okay old-man, but what about Tyler’s thesis?

    Stellent and PRONOM

    I learned sometime in my tenure at The National Archives that PRONOM had been seeded with a lot of the formats listed in a technology called OutsideIn previously owned by Stellent and now owned by Oracle.

    Oracle OutsideInhttps://docs.oracle.com/outsidein/853/oit/OutsideIn (2010)https://web.archive.org/web/20101016164937/http://www.oracle.com/technetwork/middleware/content-management/oit-all-085236.htmlData sheet – Formats (2011)https://web.archive.org/web/20110125024733/http://www.oracle.com/technetwork/middleware/content-management/ds-oitfiles-133032.pdfCOPTR entryhttps://coptr.digipres.org/index.php/Oracle_Outside_In_Technology

    I had always had a feeling that that the scope of this list was largely exaggerated by the company selling the software as it is a marketing tool; and if not exaggerated, perhaps, just not as clearly delineated by format than PRONOM, and rather, by Software, regardless of the properties of a given “format”, e.g. WinZip, and PKZip.

    Back to the story though, I was also reasonably sure I would find Tyler’s RealVideo formats in the format listing but, I did not!

    I downloaded a CSV summarizing the PRONOM records from api.pronom.ffdev.info with:

    curl -X 'GET' \
     'https://api.pronom.ffdev.info/pronom_summary_csv' \
     -H 'accept: application/csv'

    I filtered on outline entries and those without signatures only. I went through the entries still remaining and looked for name matches. I did find some name-for-name matches and some that were close, but no RealVideo or RealVideo Clip.

    The matches:

    7-bit ANSI Textyes7-bit ASCII Textyes8-bit ANSI Textyes8-bit ASCII TextyesEBCDIC-USyesFramework Database IIIyesIBM DisplayWrite Document 2yesIBM DisplayWrite Document 3yesMicrografx Designer 3.1yesNota Bene Text FileyesUnicode Text Fileyes

    The maybes:

    Cascading Style SheetmaybeFreelance File 1.0-2.1maybeMacPaint GraphicsmaybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Works DatabasemaybeMicrosoft Works Database for DOS 2.0maybeMicrosoft Works Database for Windows 3.0maybeMicrosoft Works Database for Windows 4.0maybeProfessional Write Text FilemaybeWordPerfect for Windows Document 5.2maybeXYWrite DocumentmaybeXYWrite Document IIImaybeXYWrite Document III+maybe

    11 exact matches! It’s hardly a headline!

    I had hoped that if I found more exact matches it would provide some clues to where some of the older PRONOM entries came from. I expected most of the outline records to come from this list, alas, it isn’t nearly as many as anticipated.

    I hoped too that going through the list I might get more clues as to formats that could potentially be deprecated in PRONOM.

    As it stands, from the OutsideIn list, the only records I would personally recommend for deprecation are:

    7-bit ANSI Text7-bit ASCII Text8-bit ANSI Text8-bit ASCII TextEBCDIC-USUnicode Text File

    We know enough now to be almost certain that if something that looks like these files arrives in the archive it will present as a standard text file, and that we will need to rely on determining the character encoding using tools such as Richard Lehane’s characterize (see characterize’s README for more background). It is unlikely we will be able to attach a signature to these records, and we know there are a great deal more encodings in the world than need be represented as PRONOM identifiers.

    NB. this might be something to formalize in a PRONOM decision making rubric, connected also, to formalizing approaches for XML based signatures.

    A bit of a let down, or is it?

    Still uncomfortable with so many outline records and little provenance for them, I wanted to find more information about the source of PRONOM data and so I decided to take a different path — I surfed the internet for answers!

    Out of the list of outline records I found a few to be overly specific, or slightly weird, i.e. not really things we hear much about day-to-day, some examples:

    ACBM GraphicsApple SoundAutoCAD Plot Configuration File 1.0-R13AutoCAD Plot Configuration File R14AutoSketch DrawingBtrieve Database 5.1CorelDraw PatternDEC Data Exchange FileDEC WPS Plus DocumentDr Halo BitmapGeneric Library FileHTML Extension FileHewlett Packard AdvanceWrite Text FileInkwriter/Notetaker TemplateInset Systems BitmapInstalit ScriptInterleaf DocumentMicrosoft Excel Add-InMicrosoft Excel ODBC QueryMicrosoft Excel ToolbarMicrosoft Powerpoint Design TemplateMicrosoft Print FileMicrostation CAD Drawing 95NAP MetafileNota Bene Text FileOS/2 Change Control FileRevit External GroupSAP DocumentSAS Data FileScanstudio 16-Colour BitmapSchedule+ ContactsSpeller Custom DictionaryUnisys (Sperry) System Data FileWordperfect Secondary File 5.0Wordperfect Secondary File 5.1/5.2form*Z Project File

    ACBM graphics? Dr Halo Bitmap? Btrieve database, “5.1”? where are the other five?!!

    It gave me pause. I didn’t believe these were all formats well-known to folks who created PRONOM, and I know we didn’t have such an advanced digital transfer program at the time that meant agencies were submitting huge variations of formats to PRONOM for future preservation.

    I felt they had to come from somewhere, but where?

    Enter Filext.com

    Because these formats were very specific I found listings on the internet that I knew had to be part of the story. I had immediate luck just looking for combinations of these names, e.g. ACBM Graphics + NAP Metafile.

    In particular I found listings on different websites from hobbyists or universities that all looked the same or similar, e.g.

    There were definite matches with PRONOM which we will get to, but I started to wonder about the provenance of these extensions.

    I kept looking and I found one clue, a header and footer of a file that looked like those above and read as follows:

    Copyright © 2002 Computer KnowledgeAll Rights ReservedThis download for personal use only. Do NOT distributeit to others either alone or incorporated into anysoftware without prior permission from Computer Knowledge.Developers who wish to incorporate portions of the listplease see the comments at the end of this file.
    Developer permissions....This total file may not be included in any other software orproject which presents the data to the public or portions ofthe public. Any developer who wishes to include up to (butnot more than) 2,000 individual entries from this file is freeto do so provided certain conditions are met. These are:.  1) Credit must be given to FILExt. If links are available  in the developed product then one must also be provided to  FILExt as http://filext.com..  Suggested text: "File extension list courtesy of FILExt.  For a more extensive list visit http://filext.com.".  2) Once the extensions are chosen for one product by any  developer then these same extensions must continue to be  used by that developer for any other projects (i.e., you  cannot take one set of 2,000 for one project and a different  set of 2,000 for another project; it's a total of 2,000)..  3) If links are available in the developed product then any  links appearing associated with any of the 2,000 picked  extensions must be included in the product. (This covers  future plans to include such links in this list.).When the project is complete please notify FILExt with thespecifics at [email protected]. We're always interestedin how the list is being used. Thank you.

    Filext.com!

    And so I asked myself, how long had filext been around?

    As it turns out, quite a while! It was forked from a site called cknow around 2002. cknow.com was registered around 1996 and filext.com registered in 2001.

    The first appearance of cknow in the internet archive is late 1996: https://web.archive.org/web/19961219035827/http://www.cknow.com/ and Filext early 2001: https://web.archive.org/web/20010522235126/http://www.filext.com/

    The sites were founded by Tom Simondi. It looks like he has been responsible for a lot of the 90s and 00s work around demystifying extensions and getting more information to folk about what to do with them.

    Could it be the source of the first PRONOM records?

    Comparing some of the many other text-based lists I had found with cknow and filext gave me some confidence that there was some shared heritage with the them, and so I asked, could the cknow and filext lists have also seeded PRONOM?

    I picked a list close to 2002 (cknow Extensions: 2000) when PRONOM was first started and began to compare entries for exact matches.

    ACBM GraphicsyesAutoCAD Compiled MenuyesAutoSketch DrawingyesBtrieve Database 5.1yesDataFlex Query Tag NameyesDeluxe Paint bitmapyesDesignCAD DrawingyesDigital VideoyesDr Halo BitmapyesFrame Vector MetafileyesFramework Database IIyesFramework Database IIIyesFramework Database IVyesInformation or Setup FileyesInset Systems BitmapyesInterBase DatabaseyesLotus Approach View FileyesMathematica NotebookyesMicrosoft Excel Add-InyesMicrosoft Excel ODBC QueryyesMicrosoft Excel OLAP QueryyesMicrosoft Excel OLE DB QueryyesMicrosoft Excel Web QueryyesMicrosoft FoxPro LibraryyesMicrosoft Outlook Address BookyesMicrosoft PowerPoint Graphics FileyesMicrosoft Powerpoint Add-InyesMicrosoft Visual FoxPro TableyesMicrosoft Works DatabaseyesMicrosoft Works DocumentyesMicrostation CAD Drawing 95yesNAP MetafileyesNota Bene Text FileyesOS/2 Change Control FileyesPICS AnimationyesPageMaker Document 3.0yesPageMaker Time Stamp File 4.0yesProfessional Write Text FileyesQuicken Data FileyesRealVideo Clip <– cc. Tyler!yesSchedule+ ContactsyesStatGraphics Data FileyesStructured Query Language DatayesVentura Publisher Vector GraphicsyesXYWrite Document IIIyesXYWrite Document IVyes

    46 matches!

    Apple SoundmaybeAutoCAD Device-Independent Binary Plotter FilemaybeAutoCAD Drawing TemplatemaybeCascading Style SheetmaybeDEC Data Exchange FilemaybeDEC WPS Plus DocumentmaybeFreelance File 1.0-2.1maybeJava Servlet PagemaybeMicrografx Designer 3.1maybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Office Binder Template for Windows 95maybeMicrosoft Office Binder Template for Windows 97-2003maybeMicrosoft Office Binder Wizard for Windows 95maybeMicrosoft Office Binder Wizard for Windows 97-2003maybeVentura PublishermaybeXYWrite DocumentmaybeXYWrite Document III+maybe

    17 maybes!

    What did we answer?

    Okay, 46 exact matches does not the full listing make (although many (now) full-entries may still have been made from these early listings). Filext may have been an important resource for the first PRONOM records, but it’s also likely that PRONOM had other sources of information. For example, for a number of the Microsoft formats with outline records read like export or save-as listings in previous versions of Microsoft software. E.g. Excel:

    NB. I wasn’t actively researching this side of things writing this blog, but I can already see some commonalities, especially Unicode Text!

    I know we also had a copy of the Dr Dobb’s Essential Books on File Formats CD-ROM in the archive, and so that may also have been an important resource when PRONOM was creating its first records.

    I count only two overlaps with the Stellent list, Framework Database III and Nota Bene Text File.

    We did, however, find the RealVideo Clip! And I think we found some decent correlation with a resource that looks likely to have been used partially to populate the PRONOM database.

    The era of file extensions

    • Throughout my research, I found a lot of similar websites. Filext seems to go furthest back and has the greater pedigree, but in the noughties a lot of other sites seemed to appear to try and provide similar information to internet users, a few of note that seemed comprehensive and particularly well presented:

    I am sure we looked at these sites during my time on PRONOM, although with less frequency given the need to reduce outline records and increase the number with actionable information.

    NB. I also  learned that TrID has been around since 2003! https://web.archive.org/web/20030612031252/http://mark0.ngi.it:80/

    Provenance and prior art

    It’s not entirely productive to say I wish we had better provenance for PRONOM records back in the day – but I do!

    It makes me reflect on the importance of looking outside of our own walls in digital preservation instead of the constant redundancy of reinvention or ownership.

    Often as academics, or those with archival views of the world, we can provide a polish and precision to technology as it exists to make it more usable in an archival context.

    But cknow has been around so long, and the Unix utility File was created in 1986.

    There’s a parallel history here that we should be recognizing and sharing for our next colleagues.

    I arrived at TNA in 2009 and learned about File maybe two years later. As a Windows guy at the time, that might not be uncommon, but I do feel it is on me to have known more. I also think it should have been trivial to access the provenance around some of the records in the database at the time, but more than that – as a field, shouldn’t we all know Tom Simondi? What if the same academic rigour of PRONOM and DROID could have been applied to existing tools like File? What if we had expanded our bubble and recognized digital preservation (or the tools for it) is something people have been doing in all but name for the longest time? What if the people working in parallel on these projects and websites were part of the digital preservation inner-circle community today?

    I don’t have answers, but I feel there are lessons there for the future. Not reinventing or rebuilding without good reason is important, but even if we build something new and we have been inspired by something else, continuing to recognize and acknowledge prior art is important.

    What do you think?

    Also, how do we get these people into a room and celebrate their work, and learn more!

    What next?

    I don’t think I got very far here but I found it interesting, and I hope other readers may as well.

    This is meant to be a PRONOM hack-a-thon blog and I don’t know if I have pushed the sticks forward that much but maybe there’s a bit more to reason about in the outline records, for example, around the plain-text formats mentioned above and a few more identified along the way.

    7-bit ANSI Textx-fmt/21Recommend deprecation7-bit ASCII Textx-fmt/22Recommend deprecation8-bit ANSI Textx-fmt/282Recommend deprecation8-bit ASCII Textx-fmt/283Recommend deprecationUnicode Text Filex-fmt/16Recommend deprecationEBCDIC-USfmt/159Recommend deprecationMS-DOS Text File with line breaksx-fmt/130Recommend deprecation

    I noticed in the outline entries some low-hanging fruit that I might focus on next opportunity if someone else doesn’t get there first, these would be:

    Cascading Style Sheetx-fmt/224Consider adding CSS to the record nameA signature should be feasibleDocument Type Definitionx-fmt/315Consider adding DTD to the record nameA signature should be feasibleExtensible Stylesheet Languagex-fmt/281Consider adding XSL to the record nameA signature should be feasibleHTML Extension Filex-fmt/417Related to Microsoft’s ISS serverA signature may be possibleStandard Generalized Markup Languagex-fmt/195Consider adding SGML to the record nameA signature may be possibleStill Picture Interchange File Format 2.0fmt/113Related to JPEGA signature should be possibleStructured Query Language Datafmt/206Consider adding SQL to the record nameA signature may be possibleDreamweaver Lock Filefmt/335A system file, there may be an entry in the NSRL databaseA signature may be possible

    A little more on the history of extensions websites

    The complete filext text file (allext.zip)

    It took a few jumps, but I found the complete downloadable text file from Filext.com. I don’t think it exists any more and I don’t think the internet archive managed to grab a copy. Apparently it was quite a chunk of data to download on the web once upon a time, but they eventually found a way to release a zipped text file:

    Via one jump we get to the “whole list” page:

    https://web.archive.org/web/20020605164206/http://filext.com/wholelist.htm

    And then to confirm our absolute interest in downloading it, we get to the a2z file:

    https://web.archive.org/web/20020606071418/http://filext.com/a2z.htm

    Which would have taken us to the zip file, alas, never captured on the Internet Archive anyway, maybe it is on other Memento compatible servers:

    https://web.archive.org/web/20060117000000*/http://www.filext.com:80/allext.zip

    Keeping filext up to date

    Filext still asks for registry data to help keep it up to date. That’s pretty cool!

    https://filext.com/faq/gather_data_for_filext.html

    1 │ Echo OFF
    2 │ CLS
    3 │ assoc > filext_submission_output.txt
    4 │ Echo ---------- >> filext_submission_output.txt
    5 │ ftype >> filext_submission_output.txt
    6 │ Echo Thank you. The output file has been created and
    7 │ Echo named filext_submission_output.txt and it should
    8 │ Echo be in the same place where you saved this batch
    9 │ Echo file. All that is left now is to send that file
    10 │ Echo to FILExt. Attach it to an E-mail sent to the
    11 │ Echo address: [email protected]
    12 │ Echo The E-mail subject should be: Submission
    13 │ Echo Thank you.
    14 │ Pause
    15 │ Exit

    Filext as a source of learning

    The filext faqs and community seemed particularly helpful and interesting back in the day:

    https://web.archive.org/web/20090322040812/http://filext.com/faq/

    File extension aggregator

    The file-extension.net website started an aggregator project around 2007 and it’s still running today!

    http://file-extension.net/seeker/

    Some bonus images…

    As I was working on this, I found irony in Google Sheets glitching, I managed to grab some screenshots along the way. Thanks for reading everyone!

    #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #PRONOM #WDPD #WDPD2024

  20. PRONOM’s dustiest records

    Tyler’s recent blog post for the PRONOM Hack-a-thon Week 2024 (my previous for this week), brought up an interesting point about two of PRONOM’s oldest outline records, Real Video Clip (fmt/204) and Real Video (x-fmt/277). How did they end up in PRONOM?

    Tyler suggests:

    I assume PRONOM originally added these based on MIME types available.

    I thought I knew the answer, and so it prompted a forensic look at the records to see if what I thought I knew aligned with reality!

    As a PRONOM maintainer at The National Archives, UK from 2009-2012 I knew a little bit of the history of the system, we see some of that history impact us today, for example, when we look at the number of records that don’t have descriptions or file format signatures, 156 of those records are so-called x-PUIDs. A mechanism in PRONOM that was never meant to make it into the wild for working on file formats internally without polluting the public record. There are 455 x-PUIDs in total. They made it into the wild anyway (before my time) and so they exist as a symbol of PRONOM’s dustiest oldest records.

    Even by the time I had started, PRONOM still had a lot of what we started to call outline records. One of the more positive changes we made to the process back in the day was that we would stop creating outline records; instead, we would focus on records that could be tied to signatures. This didn’t necessarily make the records more correctly aligned with reality, but it meant records had utility and file formats identified by DROID could be tied back to something that PRONOM “knew about”. I believe the process is a bit more flexible these days, allowing individuals to contribute information to records that tie them back to information like MIMEtypes and specifications. It’s clearer the format is “real” even if a signature is yet to be developed (and of course there are a large number of data formats that are hard to even represent in traditional PRONOM signatures any more and so they need a record, even if there isn’t a neat concept of a signature for them).

    Okay old-man, but what about Tyler’s thesis?

    Stellent and PRONOM

    I learned sometime in my tenure at The National Archives that PRONOM had been seeded with a lot of the formats listed in a technology called OutsideIn previously owned by Stellent and now owned by Oracle.

    Oracle OutsideInhttps://docs.oracle.com/outsidein/853/oit/OutsideIn (2010)https://web.archive.org/web/20101016164937/http://www.oracle.com/technetwork/middleware/content-management/oit-all-085236.htmlData sheet – Formats (2011)https://web.archive.org/web/20110125024733/http://www.oracle.com/technetwork/middleware/content-management/ds-oitfiles-133032.pdfCOPTR entryhttps://coptr.digipres.org/index.php/Oracle_Outside_In_Technology

    I had always had a feeling that that the scope of this list was largely exaggerated by the company selling the software as it is a marketing tool; and if not exaggerated, perhaps, just not as clearly delineated by format than PRONOM, and rather, by Software, regardless of the properties of a given “format”, e.g. WinZip, and PKZip.

    Back to the story though, I was also reasonably sure I would find Tyler’s RealVideo formats in the format listing but, I did not!

    I downloaded a CSV summarizing the PRONOM records from api.pronom.ffdev.info with:

    curl -X 'GET' \
     'https://api.pronom.ffdev.info/pronom_summary_csv' \
     -H 'accept: application/csv'

    I filtered on outline entries and those without signatures only. I went through the entries still remaining and looked for name matches. I did find some name-for-name matches and some that were close, but no RealVideo or RealVideo Clip.

    The matches:

    7-bit ANSI Textyes7-bit ASCII Textyes8-bit ANSI Textyes8-bit ASCII TextyesEBCDIC-USyesFramework Database IIIyesIBM DisplayWrite Document 2yesIBM DisplayWrite Document 3yesMicrografx Designer 3.1yesNota Bene Text FileyesUnicode Text Fileyes

    The maybes:

    Cascading Style SheetmaybeFreelance File 1.0-2.1maybeMacPaint GraphicsmaybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Works DatabasemaybeMicrosoft Works Database for DOS 2.0maybeMicrosoft Works Database for Windows 3.0maybeMicrosoft Works Database for Windows 4.0maybeProfessional Write Text FilemaybeWordPerfect for Windows Document 5.2maybeXYWrite DocumentmaybeXYWrite Document IIImaybeXYWrite Document III+maybe

    11 exact matches! It’s hardly a headline!

    I had hoped that if I found more exact matches it would provide some clues to where some of the older PRONOM entries came from. I expected most of the outline records to come from this list, alas, it isn’t nearly as many as anticipated.

    I hoped too that going through the list I might get more clues as to formats that could potentially be deprecated in PRONOM.

    As it stands, from the OutsideIn list, the only records I would personally recommend for deprecation are:

    7-bit ANSI Text7-bit ASCII Text8-bit ANSI Text8-bit ASCII TextEBCDIC-USUnicode Text File

    We know enough now to be almost certain that if something that looks like these files arrives in the archive it will present as a standard text file, and that we will need to rely on determining the character encoding using tools such as Richard Lehane’s characterize (see characterize’s README for more background). It is unlikely we will be able to attach a signature to these records, and we know there are a great deal more encodings in the world than need be represented as PRONOM identifiers.

    NB. this might be something to formalize in a PRONOM decision making rubric, connected also, to formalizing approaches for XML based signatures.

    A bit of a let down, or is it?

    Still uncomfortable with so many outline records and little provenance for them, I wanted to find more information about the source of PRONOM data and so I decided to take a different path — I surfed the internet for answers!

    Out of the list of outline records I found a few to be overly specific, or slightly weird, i.e. not really things we hear much about day-to-day, some examples:

    ACBM GraphicsApple SoundAutoCAD Plot Configuration File 1.0-R13AutoCAD Plot Configuration File R14AutoSketch DrawingBtrieve Database 5.1CorelDraw PatternDEC Data Exchange FileDEC WPS Plus DocumentDr Halo BitmapGeneric Library FileHTML Extension FileHewlett Packard AdvanceWrite Text FileInkwriter/Notetaker TemplateInset Systems BitmapInstalit ScriptInterleaf DocumentMicrosoft Excel Add-InMicrosoft Excel ODBC QueryMicrosoft Excel ToolbarMicrosoft Powerpoint Design TemplateMicrosoft Print FileMicrostation CAD Drawing 95NAP MetafileNota Bene Text FileOS/2 Change Control FileRevit External GroupSAP DocumentSAS Data FileScanstudio 16-Colour BitmapSchedule+ ContactsSpeller Custom DictionaryUnisys (Sperry) System Data FileWordperfect Secondary File 5.0Wordperfect Secondary File 5.1/5.2form*Z Project File

    ACBM graphics? Dr Halo Bitmap? Btrieve database, “5.1”? where are the other five?!!

    It gave me pause. I didn’t believe these were all formats well-known to folks who created PRONOM, and I know we didn’t have such an advanced digital transfer program at the time that meant agencies were submitting huge variations of formats to PRONOM for future preservation.

    I felt they had to come from somewhere, but where?

    Enter Filext.com

    Because these formats were very specific I found listings on the internet that I knew had to be part of the story. I had immediate luck just looking for combinations of these names, e.g. ACBM Graphics + NAP Metafile.

    In particular I found listings on different websites from hobbyists or universities that all looked the same or similar, e.g.

    There were definite matches with PRONOM which we will get to, but I started to wonder about the provenance of these extensions.

    I kept looking and I found one clue, a header and footer of a file that looked like those above and read as follows:

    Copyright © 2002 Computer KnowledgeAll Rights ReservedThis download for personal use only. Do NOT distributeit to others either alone or incorporated into anysoftware without prior permission from Computer Knowledge.Developers who wish to incorporate portions of the listplease see the comments at the end of this file.
    Developer permissions....This total file may not be included in any other software orproject which presents the data to the public or portions ofthe public. Any developer who wishes to include up to (butnot more than) 2,000 individual entries from this file is freeto do so provided certain conditions are met. These are:.  1) Credit must be given to FILExt. If links are available  in the developed product then one must also be provided to  FILExt as http://filext.com..  Suggested text: "File extension list courtesy of FILExt.  For a more extensive list visit http://filext.com.".  2) Once the extensions are chosen for one product by any  developer then these same extensions must continue to be  used by that developer for any other projects (i.e., you  cannot take one set of 2,000 for one project and a different  set of 2,000 for another project; it's a total of 2,000)..  3) If links are available in the developed product then any  links appearing associated with any of the 2,000 picked  extensions must be included in the product. (This covers  future plans to include such links in this list.).When the project is complete please notify FILExt with thespecifics at [email protected]. We're always interestedin how the list is being used. Thank you.

    Filext.com!

    And so I asked myself, how long had filext been around?

    As it turns out, quite a while! It was forked from a site called cknow around 2002. cknow.com was registered around 1996 and filext.com registered in 2001.

    The first appearance of cknow in the internet archive is late 1996: https://web.archive.org/web/19961219035827/http://www.cknow.com/ and Filext early 2001: https://web.archive.org/web/20010522235126/http://www.filext.com/

    The sites were founded by Tom Simondi. It looks like he has been responsible for a lot of the 90s and 00s work around demystifying extensions and getting more information to folk about what to do with them.

    Could it be the source of the first PRONOM records?

    Comparing some of the many other text-based lists I had found with cknow and filext gave me some confidence that there was some shared heritage with the them, and so I asked, could the cknow and filext lists have also seeded PRONOM?

    I picked a list close to 2002 (cknow Extensions: 2000) when PRONOM was first started and began to compare entries for exact matches.

    ACBM GraphicsyesAutoCAD Compiled MenuyesAutoSketch DrawingyesBtrieve Database 5.1yesDataFlex Query Tag NameyesDeluxe Paint bitmapyesDesignCAD DrawingyesDigital VideoyesDr Halo BitmapyesFrame Vector MetafileyesFramework Database IIyesFramework Database IIIyesFramework Database IVyesInformation or Setup FileyesInset Systems BitmapyesInterBase DatabaseyesLotus Approach View FileyesMathematica NotebookyesMicrosoft Excel Add-InyesMicrosoft Excel ODBC QueryyesMicrosoft Excel OLAP QueryyesMicrosoft Excel OLE DB QueryyesMicrosoft Excel Web QueryyesMicrosoft FoxPro LibraryyesMicrosoft Outlook Address BookyesMicrosoft PowerPoint Graphics FileyesMicrosoft Powerpoint Add-InyesMicrosoft Visual FoxPro TableyesMicrosoft Works DatabaseyesMicrosoft Works DocumentyesMicrostation CAD Drawing 95yesNAP MetafileyesNota Bene Text FileyesOS/2 Change Control FileyesPICS AnimationyesPageMaker Document 3.0yesPageMaker Time Stamp File 4.0yesProfessional Write Text FileyesQuicken Data FileyesRealVideo Clip <– cc. Tyler!yesSchedule+ ContactsyesStatGraphics Data FileyesStructured Query Language DatayesVentura Publisher Vector GraphicsyesXYWrite Document IIIyesXYWrite Document IVyes

    46 matches!

    Apple SoundmaybeAutoCAD Device-Independent Binary Plotter FilemaybeAutoCAD Drawing TemplatemaybeCascading Style SheetmaybeDEC Data Exchange FilemaybeDEC WPS Plus DocumentmaybeFreelance File 1.0-2.1maybeJava Servlet PagemaybeMicrografx Designer 3.1maybeMicrosoft Office Binder File for Windows 95maybeMicrosoft Office Binder Template for Windows 95maybeMicrosoft Office Binder Template for Windows 97-2003maybeMicrosoft Office Binder Wizard for Windows 95maybeMicrosoft Office Binder Wizard for Windows 97-2003maybeVentura PublishermaybeXYWrite DocumentmaybeXYWrite Document III+maybe

    17 maybes!

    What did we answer?

    Okay, 46 exact matches does not the full listing make (although many (now) full-entries may still have been made from these early listings). Filext may have been an important resource for the first PRONOM records, but it’s also likely that PRONOM had other sources of information. For example, for a number of the Microsoft formats with outline records read like export or save-as listings in previous versions of Microsoft software. E.g. Excel:

    NB. I wasn’t actively researching this side of things writing this blog, but I can already see some commonalities, especially Unicode Text!

    I know we also had a copy of the Dr Dobb’s Essential Books on File Formats CD-ROM in the archive, and so that may also have been an important resource when PRONOM was creating its first records.

    I count only two overlaps with the Stellent list, Framework Database III and Nota Bene Text File.

    We did, however, find the RealVideo Clip! And I think we found some decent correlation with a resource that looks likely to have been used partially to populate the PRONOM database.

    The era of file extensions

    • Throughout my research, I found a lot of similar websites. Filext seems to go furthest back and has the greater pedigree, but in the noughties a lot of other sites seemed to appear to try and provide similar information to internet users, a few of note that seemed comprehensive and particularly well presented:

    I am sure we looked at these sites during my time on PRONOM, although with less frequency given the need to reduce outline records and increase the number with actionable information.

    NB. I also  learned that TrID has been around since 2003! https://web.archive.org/web/20030612031252/http://mark0.ngi.it:80/

    Provenance and prior art

    It’s not entirely productive to say I wish we had better provenance for PRONOM records back in the day – but I do!

    It makes me reflect on the importance of looking outside of our own walls in digital preservation instead of the constant redundancy of reinvention or ownership.

    Often as academics, or those with archival views of the world, we can provide a polish and precision to technology as it exists to make it more usable in an archival context.

    But cknow has been around so long, and the Unix utility File was created in 1986.

    There’s a parallel history here that we should be recognizing and sharing for our next colleagues.

    I arrived at TNA in 2009 and learned about File maybe two years later. As a Windows guy at the time, that might not be uncommon, but I do feel it is on me to have known more. I also think it should have been trivial to access the provenance around some of the records in the database at the time, but more than that – as a field, shouldn’t we all know Tom Simondi? What if the same academic rigour of PRONOM and DROID could have been applied to existing tools like File? What if we had expanded our bubble and recognized digital preservation (or the tools for it) is something people have been doing in all but name for the longest time? What if the people working in parallel on these projects and websites were part of the digital preservation inner-circle community today?

    I don’t have answers, but I feel there are lessons there for the future. Not reinventing or rebuilding without good reason is important, but even if we build something new and we have been inspired by something else, continuing to recognize and acknowledge prior art is important.

    What do you think?

    Also, how do we get these people into a room and celebrate their work, and learn more!

    What next?

    I don’t think I got very far here but I found it interesting, and I hope other readers may as well.

    This is meant to be a PRONOM hack-a-thon blog and I don’t know if I have pushed the sticks forward that much but maybe there’s a bit more to reason about in the outline records, for example, around the plain-text formats mentioned above and a few more identified along the way.

    7-bit ANSI Textx-fmt/21Recommend deprecation7-bit ASCII Textx-fmt/22Recommend deprecation8-bit ANSI Textx-fmt/282Recommend deprecation8-bit ASCII Textx-fmt/283Recommend deprecationUnicode Text Filex-fmt/16Recommend deprecationEBCDIC-USfmt/159Recommend deprecationMS-DOS Text File with line breaksx-fmt/130Recommend deprecation

    I noticed in the outline entries some low-hanging fruit that I might focus on next opportunity if someone else doesn’t get there first, these would be:

    Cascading Style Sheetx-fmt/224Consider adding CSS to the record nameA signature should be feasibleDocument Type Definitionx-fmt/315Consider adding DTD to the record nameA signature should be feasibleExtensible Stylesheet Languagex-fmt/281Consider adding XSL to the record nameA signature should be feasibleHTML Extension Filex-fmt/417Related to Microsoft’s ISS serverA signature may be possibleStandard Generalized Markup Languagex-fmt/195Consider adding SGML to the record nameA signature may be possibleStill Picture Interchange File Format 2.0fmt/113Related to JPEGA signature should be possibleStructured Query Language Datafmt/206Consider adding SQL to the record nameA signature may be possibleDreamweaver Lock Filefmt/335A system file, there may be an entry in the NSRL databaseA signature may be possible

    A little more on the history of extensions websites

    The complete filext text file (allext.zip)

    It took a few jumps, but I found the complete downloadable text file from Filext.com. I don’t think it exists any more and I don’t think the internet archive managed to grab a copy. Apparently it was quite a chunk of data to download on the web once upon a time, but they eventually found a way to release a zipped text file:

    Via one jump we get to the “whole list” page:

    https://web.archive.org/web/20020605164206/http://filext.com/wholelist.htm

    And then to confirm our absolute interest in downloading it, we get to the a2z file:

    https://web.archive.org/web/20020606071418/http://filext.com/a2z.htm

    Which would have taken us to the zip file, alas, never captured on the Internet Archive anyway, maybe it is on other Memento compatible servers:

    https://web.archive.org/web/20060117000000*/http://www.filext.com:80/allext.zip

    Keeping filext up to date

    Filext still asks for registry data to help keep it up to date. That’s pretty cool!

    https://filext.com/faq/gather_data_for_filext.html

    1 │ Echo OFF
    2 │ CLS
    3 │ assoc > filext_submission_output.txt
    4 │ Echo ---------- >> filext_submission_output.txt
    5 │ ftype >> filext_submission_output.txt
    6 │ Echo Thank you. The output file has been created and
    7 │ Echo named filext_submission_output.txt and it should
    8 │ Echo be in the same place where you saved this batch
    9 │ Echo file. All that is left now is to send that file
    10 │ Echo to FILExt. Attach it to an E-mail sent to the
    11 │ Echo address: [email protected]
    12 │ Echo The E-mail subject should be: Submission
    13 │ Echo Thank you.
    14 │ Pause
    15 │ Exit

    Filext as a source of learning

    The filext faqs and community seemed particularly helpful and interesting back in the day:

    https://web.archive.org/web/20090322040812/http://filext.com/faq/

    File extension aggregator

    The file-extension.net website started an aggregator project around 2007 and it’s still running today!

    http://file-extension.net/seeker/

    Some bonus images…

    As I was working on this, I found irony in Google Sheets glitching, I managed to grab some screenshots along the way. Thanks for reading everyone!

    #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #PRONOM #WDPD #WDPD2024

  21. @Thorsted highlighted the TCDBx database with 58,921 entries of type/creator codes (with/related to 19,737 file extensions). This is a huge number, especially compared to #pronom or even #TrID! #ipres2024 (sorry for redrafting and spamming people)

  22. simpledroid: completing the circle

    It’s nearing the end of 2024 and that must mean a PRONOM hackathon as part of the World Digital Preservation Day (#WDPD2024).

    My contribution is a follow-up on my work earlier in the year to produce a valid DROID signature file from Wikidata in wddroidy.

    simpledroid is available on GitHub and creates a simple DROID signature file from PRONOM itself, creating a scripted pathway to create a signature file using official PRONOM data that doesn’t require the current PRONOM database and its legacy stored procedures.

    It also does away with a lot of the excess data in the current DROID signature file which was previously an optimization for its Boyer Moore Horspool search algorithm, as described by Matthew Palmer.

    The primary reason for simpledroid was to complete the circle on my previous efforts and to prove that it was possible to create a simplified signature file and for it to work with DROID. The result is about 80-90% there, with only a few skeleton files that remain unidentified – it should only require a small amount of forensic research to determine the reason.

    The output provides a way for simplifying the signature file generation process, offering new opportunities to create alternative versions, or filtering what’s already there, e.g. filtering out any signatures that aren’t explicitly for image identification, e.g. in a digitization workflow.

    It may provide another way into PRONOM data for those who might look at DROID first as well as opening up different ways to modify and test signatures.

    It is possible to see in the reference output, that the signatures are much easier to understand via this simplified DROID file.

    simpledroid outputs a file with a smaller footprint than the current file:

    1.2M DROID_SignatureFile_Simple_2024-11-11T12-29-22Z.xml
    3.4M DROID_SignatureFile_V118.xml

    It also contains all of the file classification data e.g. FormatType="Video" from PRONOM that will be added into DROID in a future release (and is already available in Siegfried).

    Unlike the wddroidy work, priorities have also been added to the signature file so the mechanics of the signature file are pretty close to the official version (DROID uses the signature sequence and offsets to identify a file, but it then uses a priority to determine what results to display to the user where there may otherwise be positive matches for formats that provide the foundation for another, e.g. how XML forms the basis of SVG or XHTML.

    It might be possible to remove some data around minimum and maximum offsets in the new file after discovering that simplified droid syntax requires curly bracket syntax at the beginning and end of sequences to mimic the same behavior, e.g.

    With a BOFoffset, min_offset = 2, and signature = BADF00D1, the signature needs to become {2}BADF00D1 to work.

    The code is pretty straightforward and uses a few tricks to output XML sensibly without having to build the document’s tree (DOM) in a more verbose way. There are probably a few other shortcuts I’d fix with time if the code was ever useful, including improving variable naming and adding tests.

    I’m not sure this code will ever be needed, or used by anyone, but for a quick hack and a quick proof of concept, it felt good to put it out there. Maybe someone will look at this or the wddroidy work and see there may be a way to federate different sources of signature information together into something DROID can use. Or it might be a useful demonstration to the DROID team that allows them to simplify PRONOM’s database and output mechanisms in a way that remains compatible with existing tools.

    Previous research week work

    My previous work for PRONOM research week includes a dashboard and API for getting more information out of PRONOM, including listings of those records still requiring descriptions or signatures. You may find that work interesting and it is available at https://pronom.ffdev.info and https://api.pronom.ffdev.info.

    And if you want to get in on the signature development work, signature development utility 2.0 (https://ffdev.info) was also a previous effort of mine for research week 2020 and will hopefully also benefit from outputting DROID’s simplified syntax.

    A week of file formats

    Of course with World Digital Preservation Day, file formats were pretty popular.

    Andrew Jackson attempted to calculate how many distinct formats might be out there using methods used to calculate ecological diversity.

    Amanda Tome described the scope of their work and shared a number of useful resources including useful links to the PRONOM starter pack and to the PRONOM drop-in sessions.

    You might also find out a bit more about yourself by playing this File Format Dating Game from Lotte Wijsman and colleagues: Susanne van den Eijkel, Anton van Es, Elaine Murray, Francesca Mackenzie, Ellie O’Leary, and Sharon McMeekin. (I ended up on a date with FASTA (FDD000622) in my first play-through!)

    Not specifically for WDPD, but in the same week I also enjoyed this presentation from Ange Albertini looking at different ways of identifying file formats. One big take away for me was thinking about how to get more forensic information out of a file format identification. DROID doesn’t tell us a lot, but is there a world in which one day it could?

    Let me know if you find any of this work useful at all; and good luck on your file format endeavors this week.

    #digipres #DigitalPreservation #DROID #FileFormats #PRONOM #Python #siegfried #SkeletonTestCorpus #WDPD #WDPD2024

  23. Wikidata is a good service, Wikibase (on which Wikidata is built) is a better platform.

    I have spoken before about its potential to be added into the file-format registry ecosystem in a federated model.

    If we are to use it as a registry that can perhaps complement the pipelines going into PRONOM, e.g. in vendor’s digital preservation platforms such as the Rosetta Format Library, a Wikidata should be able to output different serializations of signature file for tools such as Siegfried, DROID or FIDO.

    And what about DROID?

    Conversion to DROID

    It’s not straightforward to say to a Wikibase/Wikidata Query Service, “output XML in the shape of a DROID signature file”, but it is straightforward to write a converter script.

    I had this very thought last week while presenting with colleagues at a File Format Workshop at iPRES in Ghent.

    It dawned on me that the conversion script would actually be simple thanks to a change in format to DROID whereby it can process all its own signatures, where previously it required DROID to pre-process them. It’s a long story, a more simple rendition is that DROID no longer requires DROID byte-code to record information about an identification pattern, and can instead store signatures in the attribute of a byte sequence element as-is, i.e. a PRONOM formatted regular expression from PRONOM itself, or Wikidata.

    This realization resulted in my writing a conversion script (it took just over a half-day) during some down-time on the train home this past weekend.

    The script is called wddroidy (after WD-40 🙄🥁) and can be found here.

    Results

    We can see using the skeleton suite from Richard Lehane’s Builder that we can positively identify files using the new signature file.

    Links can also be made to work with Wikidata identifiers by modifying the PUID URL pattern in the DROID configuration, e.g. to:

    http://wikidata.org/entity/%s

    The screenshot below shows where in the dialog that setting is:

    Reference signature file

    A reference signature file can be found in the wddroidy repository here. There are approximately 8119 file formats listed and 8195 file format signatures for those.

    NB. We know there are different issues with Wikidata including how to identify a “format” and the quality of the signatures. We capture some of these in a global repository: https://github.com/ffdev-info/wikidp-issues/issues

    DROID simplified format

    The real headline here might be how easy it was to create the output using the DROID simplified format.

    I have spoken about it briefly before but not in any detail.

    In-short DROID no longer uses its own byte-code encoding that included strange terms such as DefaultShift, Shift Byte, and SubSequence (instructions to DROID about how to perform Boyer Moore Horspool search). See below and note especially how the bytes are split in Shift Byte attributes and elements:

    <?xml version="1.0" encoding="UTF-8"?><FFSignatureFile xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">  <InternalSignatureCollection>    <InternalSignature ID="1" Specificity="Specific">      <ByteSequence Reference="BOFoffset">        <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">          <Sequence>255044462D312E34</Sequence>          <DefaultShift>9</DefaultShift>          <Shift Byte="25">8</Shift>          <Shift Byte="50">7</Shift>          <Shift Byte="44">6</Shift>          <Shift Byte="46">5</Shift>          <Shift Byte="2D">4</Shift>          <Shift Byte="31">3</Shift>          <Shift Byte="2E">2</Shift>          <Shift Byte="34">1</Shift>        </SubSequence>      </ByteSequence>    </InternalSignature>  </InternalSignatureCollection>  <FileFormatCollection>    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">      <InternalSignatureID>1</InternalSignatureID>      <Extension>ext</Extension>    </FileFormat>  </FileFormatCollection></FFSignatureFile>

    The updated format was made possible via Matt Palmer via his ByteSeek work, and can now except a regularly encoded PRONOM formatted regular expression (regex) in an attribute in the ByteSequence element. See here for a signature file equivalent to the above:

    <?xml version="1.0" encoding="UTF-8"?><FFSignatureFile      xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">  <InternalSignatureCollection>    <InternalSignature ID="1" Specificity="Specific">      <ByteSequence Reference="BOFoffset" Sequence="255044462D312E34" Offset="0" />    </InternalSignature>  </InternalSignatureCollection>  <FileFormatCollection>    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">      <InternalSignatureID>1</InternalSignatureID>      <Extension>ext</Extension>    </FileFormat>  </FileFormatCollection></FFSignatureFile>

    The format is much easier to read, and after a bit of time sitting with the DROID signature file format you realize it is fairly easy to output as well. I use some very rudimentary templates in wddroidy using  Python’s f-strings.

    It means other sources of PRONOM encoded signatures can output much simpler signature files and they can be used by DROID. I myself need to add it to the signature development utility – this would allow the utility to run standalone on anyone’s PC.

    One next step for this approach might be to confirm that it does work entirely as expected by extracting all of PRONOM’s signatures proper and performing a mapping to the simplified format – if we can match against all the skeleton files in the latest Builder release then we should be looking good!

    Priorities

    I am always reminded, but always forget about priorities! This is part of how DROID resolves a file format into a single identifier, e.g. where SVG can match XML, we often want the more specific format returned, and so a priority is used to prioritize that one over the other, resulting in a single unambiguous identification for the DROID user. It manifests in the signature file as:

    <FileFormat ID="634" MIMEType="image/svg+xml" Name="Scalable Vector Graphics"PUID="fmt/91"Version="1.0">   <InternalSignatureID>24</InternalSignatureID>   <Extension>svg</Extension>   <HasPriorityOverFileFormatID>638</HasPriorityOverFileFormatID> </FileFormat> More work needs to be done with Wikidata to understand if priorities can be properly applied to a DROID signature file. They are not written into the reference signature file above.

    Using the results

    Using the results can be done for two things:

    1. (Probably) There are a greater number of patterns in the Wikidata output than in PRONOM. If you have a file that remains unidentified, you can try the reference file for clues as to what it may be. I’d only use caution and investigate the exact byte sequence used for a match and understand its properties. I’d also check that the mapping also looks accurate, I’ve tried one or two runs using the identifier and it looks good, but there may still be mistakes.
    2. For improving the quality of the sources in Wikidata. As you can see from the Skeleton suite there are a lot of gaps. We a) have a rough idea what these are, and b) know the identification doesn’t work via Wikidata. Why is that? Is the signature in Wikidata simply not good enough? Are patterns missing? Is there another error or issue we can help with given our expertise in file format identification?

    Hacking wddroidy

    You can hack wddroidy. Currently it allows you to limit the number of results returned, and also modify the ISO language code used by the tool. You can see this in the command line arguments:

    python wddroidy.py --helpusage: wddroidy [-h] [--definitions DEFINITIONS] [--wdqs] [--lang LANG] [--limit LIMIT] [--output OUTPUT] [--output-date] [--endpoint ENDPOINT]create a DROID compatible signature file from Wikidataoptions: -h, --help show this help message and exit --definitions DEFINITIONS   use a local definitions file, e.g. from Siegfried --wdqs, -w live results from Wikidata --lang LANG, -l LANG change Wikidata language results --limit LIMIT, -n LIMIT   limit the number of resukts --output OUTPUT, -o OUTPUT   filename to output to --output-date, -t output a default file with the current timestamp --endpoint ENDPOINT, -url ENDPOINT   url of the WDQSfor more information visit https://github.com/ross-spencer/wddroidy

    The actual SPARQL query used can be manually edited in the src folder. E.g. you can limit the query by format or family or classification. I provide some more inspiration in the Siegfried Wiki.

    Let me know if it’s useful!

    This is really just a quick hack and it needs a lot more testing to improve the quality of the output. Most can be dealt with on the Wikidata side I am sure, but some might need to be done in the tool. If it’s useful, reach out, and let’s discuss what can be changed or how it can be used in your work.

    Data quality

    It will quickly become apparent the data quality isn’t what it is with PRONOM and that is why a curated and authoritative service such as PRONOM is always going to be needed. As mentioned in previous talks, this can in theory be complemented with downstream data in federated databases. This might mean curating Wikidata better using some of the tools available, or curating data into a Wikibase (the platfom Wikidata is built upon). Both options bring different benefits and advantages such as creating a bigger tent of signature developers on Wikidata, or, another example, more expressive signatures being made available via federated Wikibases.

    And a word on Wikiba.se

    A reminder too, that setting up a Wikibase can take some effort (I was once running three at the same time 😬) but a service called https://wikiba.se/ exists. wikiba.se could form an excellent scratch pad to begin thinking about mapping PRONOM like data to a Wikibase and also begin solving some of the other issues around mapping container signatures and outputting those in a way that is compatible for DROID. Let me know if you give it a whirl, or want to collab on any of that.

    Otherwise, thanks in advance! And enjoy wddroidy!

    https://exponentialdecay.co.uk/blog/making-droid-work-with-wikidata/

    #Code #Coding #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #OpenData #PRONOM #siegfried #SoftwareDevelopment #wikidata

  24. @nkrabben on AV Formats in #PRONOM. (Also discussing DROID and #Siegfried.) Example of the difference between how PRONOM looks for an mp4 signature vs how #MediaInfo identifies an mp4. #NTTW7

  25. Back by popular demand, the #PRONOM team will be running their yearly hackathon on 7th November-15th November, to celebrate #WDPD

    They will be kicking off the week with a PRONOM Open Drop-In session on the 7th dedicated to answering your questions.

    openpreservation.org/news/pron

  26. Google dorking for finding sample files is really neat - many thanks to the #PRONOM team & @beet_keeper for pointing this out!

    E.g., :
    inurl:AndroidManifest.xml filetype:xml supports-screens uses-library uses-feature

    helped me finding universal #APK manifests with dependencies!

    🥳

    #digipres

  27. Shattering the eyeglass: Using Kaitai Structs to dissect the eyeglass’ contents


    by @beet_keeper

    In my post from 2012: Genesis of a File Format, I created a new file format – the Eyeglass file format. The format provides a mechanism to persist information about a patient’s eye health following a checkup at an opticians. Today in 2023 we can use the format to understand how to make use of Kaitai Structs for understanding file formats.

    Given the disclaimer that I am not actually an optician and that the format is purely illustrative, let’s look at the eyeglass again below.

    #Code #Coding #digipres #digitalLiteracy #DigitalPreservation #FileFormat #FileFormatAnalysis #FileFormats #kaitai #PRONOM #YYYY

  28. Artefactual has released v1.16.0 of its digital depot system Archivematica (and v.0.22.0 of the underlying Storage Service). This new release includes - among other things - a PRONOM update. See for the release notes: wiki.archivematica.org/Archive

    #Archieven #Archives #Archivematica #DigiPres #Pronom

  29. Hacking the DROID Signature File for Characterization


    by @beet_keeper

    Identification of a format can be approached from many angles. Often a magic

    number will be used at the beginning of a file. This may be strengthened by the addition of similarly consistent bytes at the end of a file or indeed any part of the bitstream inbetween. Using the sample file format we created last week the magic number to identify it is specified

    #characterization #digitalPreservation #droid #fileFormats #linkedData #magicNumbers #pronom #sparql

  30. simpledroid: completing the circle

    It’s nearing the end of 2024 and that must mean a PRONOM hackathon as part of the World Digital Preservation Day (#WDPD2024).

    My contribution is a follow-up on my work earlier in the year to produce a valid DROID signature file from Wikidata in wddroidy.

    simpledroid is available on GitHub and creates a simple DROID signature file from PRONOM itself, creating a scripted pathway to create a signature file using official PRONOM data that doesn’t require the current PRONOM database and its legacy stored procedures.

    It also does away with a lot of the excess data in the current DROID signature file which was previously an optimization for its Boyer Moore Horspool search algorithm, as described by Matthew Palmer.

    The primary reason for simpledroid was to complete the circle on my previous efforts and to prove that it was possible to create a simplified signature file and for it to work with DROID. The result is about 80-90% there, with only a few skeleton files that remain unidentified – it should only require a small amount of forensic research to determine the reason.

    The output provides a way for simplifying the signature file generation process, offering new opportunities to create alternative versions, or filtering what’s already there, e.g. filtering out any signatures that aren’t explicitly for image identification, e.g. in a digitization workflow.

    It may provide another way into PRONOM data for those who might look at DROID first as well as opening up different ways to modify and test signatures.

    It is possible to see in the reference output, that the signatures are much easier to understand via this simplified DROID file.

    simpledroid outputs a file with a smaller footprint than the current file:

    1.2M DROID_SignatureFile_Simple_2024-11-11T12-29-22Z.xml
    3.4M DROID_SignatureFile_V118.xml

    It also contains all of the file classification data e.g. FormatType="Video" from PRONOM that will be added into DROID in a future release (and is already available in Siegfried).

    Unlike the wddroidy work, priorities have also been added to the signature file so the mechanics of the signature file are pretty close to the official version (DROID uses the signature sequence and offsets to identify a file, but it then uses a priority to determine what results to display to the user where there may otherwise be positive matches for formats that provide the foundation for another, e.g. how XML forms the basis of SVG or XHTML.

    It might be possible to remove some data around minimum and maximum offsets in the new file after discovering that simplified droid syntax requires curly bracket syntax at the beginning and end of sequences to mimic the same behavior, e.g.

    With a BOFoffset, min_offset = 2, and signature = BADF00D1, the signature needs to become {2}BADF00D1 to work.

    The code is pretty straightforward and uses a few tricks to output XML sensibly without having to build the document’s tree (DOM) in a more verbose way. There are probably a few other shortcuts I’d fix with time if the code was ever useful, including improving variable naming and adding tests.

    I’m not sure this code will ever be needed, or used by anyone, but for a quick hack and a quick proof of concept, it felt good to put it out there. Maybe someone will look at this or the wddroidy work and see there may be a way to federate different sources of signature information together into something DROID can use. Or it might be a useful demonstration to the DROID team that allows them to simplify PRONOM’s database and output mechanisms in a way that remains compatible with existing tools.

    Previous research week work

    My previous work for PRONOM research week includes a dashboard and API for getting more information out of PRONOM, including listings of those records still requiring descriptions or signatures. You may find that work interesting and it is available at https://pronom.ffdev.info and https://api.pronom.ffdev.info.

    And if you want to get in on the signature development work, signature development utility 2.0 (https://ffdev.info) was also a previous effort of mine for research week 2020 and will hopefully also benefit from outputting DROID’s simplified syntax.

    A week of file formats

    Of course with World Digital Preservation Day, file formats were pretty popular.

    Andrew Jackson attempted to calculate how many distinct formats might be out there using methods used to calculate ecological diversity.

    Amanda Tome described the scope of their work and shared a number of useful resources including useful links to the PRONOM starter pack and to the PRONOM drop-in sessions.

    You might also find out a bit more about yourself by playing this File Format Dating Game from Lotte Wijsman and colleagues: Susanne van den Eijkel, Anton van Es, Elaine Murray, Francesca Mackenzie, Ellie O’Leary, and Sharon McMeekin. (I ended up on a date with FASTA (FDD000622) in my first play-through!)

    Not specifically for WDPD, but in the same week I also enjoyed this presentation from Ange Albertini looking at different ways of identifying file formats. One big take away for me was thinking about how to get more forensic information out of a file format identification. DROID doesn’t tell us a lot, but is there a world in which one day it could?

    Let me know if you find any of this work useful at all; and good luck on your file format endeavors this week.

    #digipres #DigitalPreservation #DROID #FileFormats #PRONOM #Python #siegfried #SkeletonTestCorpus #WDPD #WDPD2024

  31. simpledroid: completing the circle

    It’s nearing the end of 2024 and that must mean a PRONOM hackathon as part of the World Digital Preservation Day (#WDPD2024).

    My contribution is a follow-up on my work earlier in the year to produce a valid DROID signature file from Wikidata in wddroidy.

    simpledroid is available on GitHub and creates a simple DROID signature file from PRONOM itself, creating a scripted pathway to create a signature file using official PRONOM data that doesn’t require the current PRONOM database and its legacy stored procedures.

    It also does away with a lot of the excess data in the current DROID signature file which was previously an optimization for its Boyer Moore Horspool search algorithm, as described by Matthew Palmer.

    The primary reason for simpledroid was to complete the circle on my previous efforts and to prove that it was possible to create a simplified signature file and for it to work with DROID. The result is about 80-90% there, with only a few skeleton files that remain unidentified – it should only require a small amount of forensic research to determine the reason.

    The output provides a way for simplifying the signature file generation process, offering new opportunities to create alternative versions, or filtering what’s already there, e.g. filtering out any signatures that aren’t explicitly for image identification, e.g. in a digitization workflow.

    It may provide another way into PRONOM data for those who might look at DROID first as well as opening up different ways to modify and test signatures.

    It is possible to see in the reference output, that the signatures are much easier to understand via this simplified DROID file.

    simpledroid outputs a file with a smaller footprint than the current file:

    1.2M DROID_SignatureFile_Simple_2024-11-11T12-29-22Z.xml
    3.4M DROID_SignatureFile_V118.xml

    It also contains all of the file classification data e.g. FormatType="Video" from PRONOM that will be added into DROID in a future release (and is already available in Siegfried).

    Unlike the wddroidy work, priorities have also been added to the signature file so the mechanics of the signature file are pretty close to the official version (DROID uses the signature sequence and offsets to identify a file, but it then uses a priority to determine what results to display to the user where there may otherwise be positive matches for formats that provide the foundation for another, e.g. how XML forms the basis of SVG or XHTML.

    It might be possible to remove some data around minimum and maximum offsets in the new file after discovering that simplified droid syntax requires curly bracket syntax at the beginning and end of sequences to mimic the same behavior, e.g.

    With a BOFoffset, min_offset = 2, and signature = BADF00D1, the signature needs to become {2}BADF00D1 to work.

    The code is pretty straightforward and uses a few tricks to output XML sensibly without having to build the document’s tree (DOM) in a more verbose way. There are probably a few other shortcuts I’d fix with time if the code was ever useful, including improving variable naming and adding tests.

    I’m not sure this code will ever be needed, or used by anyone, but for a quick hack and a quick proof of concept, it felt good to put it out there. Maybe someone will look at this or the wddroidy work and see there may be a way to federate different sources of signature information together into something DROID can use. Or it might be a useful demonstration to the DROID team that allows them to simplify PRONOM’s database and output mechanisms in a way that remains compatible with existing tools.

    Previous research week work

    My previous work for PRONOM research week includes a dashboard and API for getting more information out of PRONOM, including listings of those records still requiring descriptions or signatures. You may find that work interesting and it is available at https://pronom.ffdev.info and https://api.pronom.ffdev.info.

    And if you want to get in on the signature development work, signature development utility 2.0 (https://ffdev.info) was also a previous effort of mine for research week 2020 and will hopefully also benefit from outputting DROID’s simplified syntax.

    A week of file formats

    Of course with World Digital Preservation Day, file formats were pretty popular.

    Andrew Jackson attempted to calculate how many distinct formats might be out there using methods used to calculate ecological diversity.

    Amanda Tome described the scope of their work and shared a number of useful resources including useful links to the PRONOM starter pack and to the PRONOM drop-in sessions.

    You might also find out a bit more about yourself by playing this File Format Dating Game from Lotte Wijsman and colleagues: Susanne van den Eijkel, Anton van Es, Elaine Murray, Francesca Mackenzie, Ellie O’Leary, and Sharon McMeekin. (I ended up on a date with FASTA (FDD000622) in my first play-through!)

    Not specifically for WDPD, but in the same week I also enjoyed this presentation from Ange Albertini looking at different ways of identifying file formats. One big take away for me was thinking about how to get more forensic information out of a file format identification. DROID doesn’t tell us a lot, but is there a world in which one day it could?

    Let me know if you find any of this work useful at all; and good luck on your file format endeavors this week.

    #digipres #DigitalPreservation #DROID #FileFormats #PRONOM #Python #siegfried #SkeletonTestCorpus #WDPD #WDPD2024

  32. simpledroid: completing the circle

    It’s nearing the end of 2024 and that must mean a PRONOM hackathon as part of the World Digital Preservation Day (#WDPD2024).

    My contribution is a follow-up on my work earlier in the year to produce a valid DROID signature file from Wikidata in wddroidy.

    simpledroid is available on GitHub and creates a simple DROID signature file from PRONOM itself, creating a scripted pathway to create a signature file using official PRONOM data that doesn’t require the current PRONOM database and its legacy stored procedures.

    It also does away with a lot of the excess data in the current DROID signature file which was previously an optimization for its Boyer Moore Horspool search algorithm, as described by Matthew Palmer.

    The primary reason for simpledroid was to complete the circle on my previous efforts and to prove that it was possible to create a simplified signature file and for it to work with DROID. The result is about 80-90% there, with only a few skeleton files that remain unidentified – it should only require a small amount of forensic research to determine the reason.

    The output provides a way for simplifying the signature file generation process, offering new opportunities to create alternative versions, or filtering what’s already there, e.g. filtering out any signatures that aren’t explicitly for image identification, e.g. in a digitization workflow.

    It may provide another way into PRONOM data for those who might look at DROID first as well as opening up different ways to modify and test signatures.

    It is possible to see in the reference output, that the signatures are much easier to understand via this simplified DROID file.

    simpledroid outputs a file with a smaller footprint than the current file:

    1.2M DROID_SignatureFile_Simple_2024-11-11T12-29-22Z.xml
    3.4M DROID_SignatureFile_V118.xml

    It also contains all of the file classification data e.g. FormatType="Video" from PRONOM that will be added into DROID in a future release (and is already available in Siegfried).

    Unlike the wddroidy work, priorities have also been added to the signature file so the mechanics of the signature file are pretty close to the official version (DROID uses the signature sequence and offsets to identify a file, but it then uses a priority to determine what results to display to the user where there may otherwise be positive matches for formats that provide the foundation for another, e.g. how XML forms the basis of SVG or XHTML.

    It might be possible to remove some data around minimum and maximum offsets in the new file after discovering that simplified droid syntax requires curly bracket syntax at the beginning and end of sequences to mimic the same behavior, e.g.

    With a BOFoffset, min_offset = 2, and signature = BADF00D1, the signature needs to become {2}BADF00D1 to work.

    The code is pretty straightforward and uses a few tricks to output XML sensibly without having to build the document’s tree (DOM) in a more verbose way. There are probably a few other shortcuts I’d fix with time if the code was ever useful, including improving variable naming and adding tests.

    I’m not sure this code will ever be needed, or used by anyone, but for a quick hack and a quick proof of concept, it felt good to put it out there. Maybe someone will look at this or the wddroidy work and see there may be a way to federate different sources of signature information together into something DROID can use. Or it might be a useful demonstration to the DROID team that allows them to simplify PRONOM’s database and output mechanisms in a way that remains compatible with existing tools.

    Previous research week work

    My previous work for PRONOM research week includes a dashboard and API for getting more information out of PRONOM, including listings of those records still requiring descriptions or signatures. You may find that work interesting and it is available at https://pronom.ffdev.info and https://api.pronom.ffdev.info.

    And if you want to get in on the signature development work, signature development utility 2.0 (https://ffdev.info) was also a previous effort of mine for research week 2020 and will hopefully also benefit from outputting DROID’s simplified syntax.

    A week of file formats

    Of course with World Digital Preservation Day, file formats were pretty popular.

    Andrew Jackson attempted to calculate how many distinct formats might be out there using methods used to calculate ecological diversity.

    Amanda Tome described the scope of their work and shared a number of useful resources including useful links to the PRONOM starter pack and to the PRONOM drop-in sessions.

    You might also find out a bit more about yourself by playing this File Format Dating Game from Lotte Wijsman and colleagues: Susanne van den Eijkel, Anton van Es, Elaine Murray, Francesca Mackenzie, Ellie O’Leary, and Sharon McMeekin. (I ended up on a date with FASTA (FDD000622) in my first play-through!)

    Not specifically for WDPD, but in the same week I also enjoyed this presentation from Ange Albertini looking at different ways of identifying file formats. One big take away for me was thinking about how to get more forensic information out of a file format identification. DROID doesn’t tell us a lot, but is there a world in which one day it could?

    Let me know if you find any of this work useful at all; and good luck on your file format endeavors this week.

    #digipres #DigitalPreservation #DROID #FileFormats #PRONOM #Python #siegfried #SkeletonTestCorpus #WDPD #WDPD2024

  33. Wikidata is a good service, Wikibase (on which Wikidata is built) is a better platform.

    I have spoken before about its potential to be added into the file-format registry ecosystem in a federated model.

    If we are to use it as a registry that can perhaps complement the pipelines going into PRONOM, e.g. in vendor’s digital preservation platforms such as the Rosetta Format Library, a Wikidata should be able to output different serializations of signature file for tools such as Siegfried, DROID or FIDO.

    And what about DROID?

    Conversion to DROID

    It’s not straightforward to say to a Wikibase/Wikidata Query Service, “output XML in the shape of a DROID signature file”, but it is straightforward to write a converter script.

    I had this very thought last week while presenting with colleagues at a File Format Workshop at iPRES in Ghent.

    It dawned on me that the conversion script would actually be simple thanks to a change in format to DROID whereby it can process all its own signatures, where previously it required DROID to pre-process them. It’s a long story, a more simple rendition is that DROID no longer requires DROID byte-code to record information about an identification pattern, and can instead store signatures in the attribute of a byte sequence element as-is, i.e. a PRONOM formatted regular expression from PRONOM itself, or Wikidata.

    This realization resulted in my writing a conversion script (it took just over a half-day) during some down-time on the train home this past weekend.

    The script is called wddroidy (after WD-40 🙄🥁) and can be found here.

    Results

    We can see using the skeleton suite from Richard Lehane’s Builder that we can positively identify files using the new signature file.

    Links can also be made to work with Wikidata identifiers by modifying the PUID URL pattern in the DROID configuration, e.g. to:

    http://wikidata.org/entity/%s

    The screenshot below shows where in the dialog that setting is:

    Reference signature file

    A reference signature file can be found in the wddroidy repository here. There are approximately 8119 file formats listed and 8195 file format signatures for those.

    NB. We know there are different issues with Wikidata including how to identify a “format” and the quality of the signatures. We capture some of these in a global repository: https://github.com/ffdev-info/wikidp-issues/issues

    DROID simplified format

    The real headline here might be how easy it was to create the output using the DROID simplified format.

    I have spoken about it briefly before but not in any detail.

    In-short DROID no longer uses its own byte-code encoding that included strange terms such as DefaultShift, Shift Byte, and SubSequence (instructions to DROID about how to perform Boyer Moore Horspool search). See below and note especially how the bytes are split in Shift Byte attributes and elements:

    <?xml version="1.0" encoding="UTF-8"?><FFSignatureFile xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">  <InternalSignatureCollection>    <InternalSignature ID="1" Specificity="Specific">      <ByteSequence Reference="BOFoffset">        <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">          <Sequence>255044462D312E34</Sequence>          <DefaultShift>9</DefaultShift>          <Shift Byte="25">8</Shift>          <Shift Byte="50">7</Shift>          <Shift Byte="44">6</Shift>          <Shift Byte="46">5</Shift>          <Shift Byte="2D">4</Shift>          <Shift Byte="31">3</Shift>          <Shift Byte="2E">2</Shift>          <Shift Byte="34">1</Shift>        </SubSequence>      </ByteSequence>    </InternalSignature>  </InternalSignatureCollection>  <FileFormatCollection>    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">      <InternalSignatureID>1</InternalSignatureID>      <Extension>ext</Extension>    </FileFormat>  </FileFormatCollection></FFSignatureFile>

    The updated format was made possible via Matt Palmer via his ByteSeek work, and can now except a regularly encoded PRONOM formatted regular expression (regex) in an attribute in the ByteSequence element. See here for a signature file equivalent to the above:

    <?xml version="1.0" encoding="UTF-8"?><FFSignatureFile      xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">  <InternalSignatureCollection>    <InternalSignature ID="1" Specificity="Specific">      <ByteSequence Reference="BOFoffset" Sequence="255044462D312E34" Offset="0" />    </InternalSignature>  </InternalSignatureCollection>  <FileFormatCollection>    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">      <InternalSignatureID>1</InternalSignatureID>      <Extension>ext</Extension>    </FileFormat>  </FileFormatCollection></FFSignatureFile>

    The format is much easier to read, and after a bit of time sitting with the DROID signature file format you realize it is fairly easy to output as well. I use some very rudimentary templates in wddroidy using  Python’s f-strings.

    It means other sources of PRONOM encoded signatures can output much simpler signature files and they can be used by DROID. I myself need to add it to the signature development utility – this would allow the utility to run standalone on anyone’s PC.

    One next step for this approach might be to confirm that it does work entirely as expected by extracting all of PRONOM’s signatures proper and performing a mapping to the simplified format – if we can match against all the skeleton files in the latest Builder release then we should be looking good!

    Priorities

    I am always reminded, but always forget about priorities! This is part of how DROID resolves a file format into a single identifier, e.g. where SVG can match XML, we often want the more specific format returned, and so a priority is used to prioritize that one over the other, resulting in a single unambiguous identification for the DROID user. It manifests in the signature file as:

    <FileFormat ID="634" MIMEType="image/svg+xml" Name="Scalable Vector Graphics"PUID="fmt/91"Version="1.0">   <InternalSignatureID>24</InternalSignatureID>   <Extension>svg</Extension>   <HasPriorityOverFileFormatID>638</HasPriorityOverFileFormatID> </FileFormat> More work needs to be done with Wikidata to understand if priorities can be properly applied to a DROID signature file. They are not written into the reference signature file above.

    Using the results

    Using the results can be done for two things:

    1. (Probably) There are a greater number of patterns in the Wikidata output than in PRONOM. If you have a file that remains unidentified, you can try the reference file for clues as to what it may be. I’d only use caution and investigate the exact byte sequence used for a match and understand its properties. I’d also check that the mapping also looks accurate, I’ve tried one or two runs using the identifier and it looks good, but there may still be mistakes.
    2. For improving the quality of the sources in Wikidata. As you can see from the Skeleton suite there are a lot of gaps. We a) have a rough idea what these are, and b) know the identification doesn’t work via Wikidata. Why is that? Is the signature in Wikidata simply not good enough? Are patterns missing? Is there another error or issue we can help with given our expertise in file format identification?

    Hacking wddroidy

    You can hack wddroidy. Currently it allows you to limit the number of results returned, and also modify the ISO language code used by the tool. You can see this in the command line arguments:

    python wddroidy.py --helpusage: wddroidy [-h] [--definitions DEFINITIONS] [--wdqs] [--lang LANG] [--limit LIMIT] [--output OUTPUT] [--output-date] [--endpoint ENDPOINT]create a DROID compatible signature file from Wikidataoptions: -h, --help show this help message and exit --definitions DEFINITIONS   use a local definitions file, e.g. from Siegfried --wdqs, -w live results from Wikidata --lang LANG, -l LANG change Wikidata language results --limit LIMIT, -n LIMIT   limit the number of resukts --output OUTPUT, -o OUTPUT   filename to output to --output-date, -t output a default file with the current timestamp --endpoint ENDPOINT, -url ENDPOINT   url of the WDQSfor more information visit https://github.com/ross-spencer/wddroidy

    The actual SPARQL query used can be manually edited in the src folder. E.g. you can limit the query by format or family or classification. I provide some more inspiration in the Siegfried Wiki.

    Let me know if it’s useful!

    This is really just a quick hack and it needs a lot more testing to improve the quality of the output. Most can be dealt with on the Wikidata side I am sure, but some might need to be done in the tool. If it’s useful, reach out, and let’s discuss what can be changed or how it can be used in your work.

    Data quality

    It will quickly become apparent the data quality isn’t what it is with PRONOM and that is why a curated and authoritative service such as PRONOM is always going to be needed. As mentioned in previous talks, this can in theory be complemented with downstream data in federated databases. This might mean curating Wikidata better using some of the tools available, or curating data into a Wikibase (the platfom Wikidata is built upon). Both options bring different benefits and advantages such as creating a bigger tent of signature developers on Wikidata, or, another example, more expressive signatures being made available via federated Wikibases.

    And a word on Wikiba.se

    A reminder too, that setting up a Wikibase can take some effort (I was once running three at the same time 😬) but a service called https://wikiba.se/ exists. wikiba.se could form an excellent scratch pad to begin thinking about mapping PRONOM like data to a Wikibase and also begin solving some of the other issues around mapping container signatures and outputting those in a way that is compatible for DROID. Let me know if you give it a whirl, or want to collab on any of that.

    Otherwise, thanks in advance! And enjoy wddroidy!

    https://exponentialdecay.co.uk/blog/making-droid-work-with-wikidata/

    #Code #Coding #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #OpenData #PRONOM #siegfried #SoftwareDevelopment #wikidata

  34. Wikidata is a good service, Wikibase (on which Wikidata is built) is a better platform.

    I have spoken before about its potential to be added into the file-format registry ecosystem in a federated model.

    If we are to use it as a registry that can perhaps complement the pipelines going into PRONOM, e.g. in vendor’s digital preservation platforms such as the Rosetta Format Library, a Wikidata should be able to output different serializations of signature file for tools such as Siegfried, DROID or FIDO.

    And what about DROID?

    Conversion to DROID

    It’s not straightforward to say to a Wikibase/Wikidata Query Service, “output XML in the shape of a DROID signature file”, but it is straightforward to write a converter script.

    I had this very thought last week while presenting with colleagues at a File Format Workshop at iPRES in Ghent.

    It dawned on me that the conversion script would actually be simple thanks to a change in format to DROID whereby it can process all its own signatures, where previously it required DROID to pre-process them. It’s a long story, a more simple rendition is that DROID no longer requires DROID byte-code to record information about an identification pattern, and can instead store signatures in the attribute of a byte sequence element as-is, i.e. a PRONOM formatted regular expression from PRONOM itself, or Wikidata.

    This realization resulted in my writing a conversion script (it took just over a half-day) during some down-time on the train home this past weekend.

    The script is called wddroidy (after WD-40 🙄🥁) and can be found here.

    Results

    We can see using the skeleton suite from Richard Lehane’s Builder that we can positively identify files using the new signature file.

    Links can also be made to work with Wikidata identifiers by modifying the PUID URL pattern in the DROID configuration, e.g. to:

    http://wikidata.org/entity/%s

    The screenshot below shows where in the dialog that setting is:

    Reference signature file

    A reference signature file can be found in the wddroidy repository here. There are approximately 8119 file formats listed and 8195 file format signatures for those.

    NB. We know there are different issues with Wikidata including how to identify a “format” and the quality of the signatures. We capture some of these in a global repository: https://github.com/ffdev-info/wikidp-issues/issues

    DROID simplified format

    The real headline here might be how easy it was to create the output using the DROID simplified format.

    I have spoken about it briefly before but not in any detail.

    In-short DROID no longer uses its own byte-code encoding that included strange terms such as DefaultShift, Shift Byte, and SubSequence (instructions to DROID about how to perform Boyer Moore Horspool search). See below and note especially how the bytes are split in Shift Byte attributes and elements:

    <?xml version="1.0" encoding="UTF-8"?><FFSignatureFile xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">  <InternalSignatureCollection>    <InternalSignature ID="1" Specificity="Specific">      <ByteSequence Reference="BOFoffset">        <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">          <Sequence>255044462D312E34</Sequence>          <DefaultShift>9</DefaultShift>          <Shift Byte="25">8</Shift>          <Shift Byte="50">7</Shift>          <Shift Byte="44">6</Shift>          <Shift Byte="46">5</Shift>          <Shift Byte="2D">4</Shift>          <Shift Byte="31">3</Shift>          <Shift Byte="2E">2</Shift>          <Shift Byte="34">1</Shift>        </SubSequence>      </ByteSequence>    </InternalSignature>  </InternalSignatureCollection>  <FileFormatCollection>    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">      <InternalSignatureID>1</InternalSignatureID>      <Extension>ext</Extension>    </FileFormat>  </FileFormatCollection></FFSignatureFile>

    The updated format was made possible via Matt Palmer via his ByteSeek work, and can now except a regularly encoded PRONOM formatted regular expression (regex) in an attribute in the ByteSequence element. See here for a signature file equivalent to the above:

    <?xml version="1.0" encoding="UTF-8"?><FFSignatureFile      xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">  <InternalSignatureCollection>    <InternalSignature ID="1" Specificity="Specific">      <ByteSequence Reference="BOFoffset" Sequence="255044462D312E34" Offset="0" />    </InternalSignature>  </InternalSignatureCollection>  <FileFormatCollection>    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">      <InternalSignatureID>1</InternalSignatureID>      <Extension>ext</Extension>    </FileFormat>  </FileFormatCollection></FFSignatureFile>

    The format is much easier to read, and after a bit of time sitting with the DROID signature file format you realize it is fairly easy to output as well. I use some very rudimentary templates in wddroidy using  Python’s f-strings.

    It means other sources of PRONOM encoded signatures can output much simpler signature files and they can be used by DROID. I myself need to add it to the signature development utility – this would allow the utility to run standalone on anyone’s PC.

    One next step for this approach might be to confirm that it does work entirely as expected by extracting all of PRONOM’s signatures proper and performing a mapping to the simplified format – if we can match against all the skeleton files in the latest Builder release then we should be looking good!

    Priorities

    I am always reminded, but always forget about priorities! This is part of how DROID resolves a file format into a single identifier, e.g. where SVG can match XML, we often want the more specific format returned, and so a priority is used to prioritize that one over the other, resulting in a single unambiguous identification for the DROID user. It manifests in the signature file as:

    <FileFormat ID="634" MIMEType="image/svg+xml" Name="Scalable Vector Graphics"PUID="fmt/91"Version="1.0">   <InternalSignatureID>24</InternalSignatureID>   <Extension>svg</Extension>   <HasPriorityOverFileFormatID>638</HasPriorityOverFileFormatID> </FileFormat> More work needs to be done with Wikidata to understand if priorities can be properly applied to a DROID signature file. They are not written into the reference signature file above.

    Using the results

    Using the results can be done for two things:

    1. (Probably) There are a greater number of patterns in the Wikidata output than in PRONOM. If you have a file that remains unidentified, you can try the reference file for clues as to what it may be. I’d only use caution and investigate the exact byte sequence used for a match and understand its properties. I’d also check that the mapping also looks accurate, I’ve tried one or two runs using the identifier and it looks good, but there may still be mistakes.
    2. For improving the quality of the sources in Wikidata. As you can see from the Skeleton suite there are a lot of gaps. We a) have a rough idea what these are, and b) know the identification doesn’t work via Wikidata. Why is that? Is the signature in Wikidata simply not good enough? Are patterns missing? Is there another error or issue we can help with given our expertise in file format identification?

    Hacking wddroidy

    You can hack wddroidy. Currently it allows you to limit the number of results returned, and also modify the ISO language code used by the tool. You can see this in the command line arguments:

    python wddroidy.py --helpusage: wddroidy [-h] [--definitions DEFINITIONS] [--wdqs] [--lang LANG] [--limit LIMIT] [--output OUTPUT] [--output-date] [--endpoint ENDPOINT]create a DROID compatible signature file from Wikidataoptions: -h, --help show this help message and exit --definitions DEFINITIONS   use a local definitions file, e.g. from Siegfried --wdqs, -w live results from Wikidata --lang LANG, -l LANG change Wikidata language results --limit LIMIT, -n LIMIT   limit the number of resukts --output OUTPUT, -o OUTPUT   filename to output to --output-date, -t output a default file with the current timestamp --endpoint ENDPOINT, -url ENDPOINT   url of the WDQSfor more information visit https://github.com/ross-spencer/wddroidy

    The actual SPARQL query used can be manually edited in the src folder. E.g. you can limit the query by format or family or classification. I provide some more inspiration in the Siegfried Wiki.

    Let me know if it’s useful!

    This is really just a quick hack and it needs a lot more testing to improve the quality of the output. Most can be dealt with on the Wikidata side I am sure, but some might need to be done in the tool. If it’s useful, reach out, and let’s discuss what can be changed or how it can be used in your work.

    Data quality

    It will quickly become apparent the data quality isn’t what it is with PRONOM and that is why a curated and authoritative service such as PRONOM is always going to be needed. As mentioned in previous talks, this can in theory be complemented with downstream data in federated databases. This might mean curating Wikidata better using some of the tools available, or curating data into a Wikibase (the platfom Wikidata is built upon). Both options bring different benefits and advantages such as creating a bigger tent of signature developers on Wikidata, or, another example, more expressive signatures being made available via federated Wikibases.

    And a word on Wikiba.se

    A reminder too, that setting up a Wikibase can take some effort (I was once running three at the same time 😬) but a service called https://wikiba.se/ exists. wikiba.se could form an excellent scratch pad to begin thinking about mapping PRONOM like data to a Wikibase and also begin solving some of the other issues around mapping container signatures and outputting those in a way that is compatible for DROID. Let me know if you give it a whirl, or want to collab on any of that.

    Otherwise, thanks in advance! And enjoy wddroidy!

    https://exponentialdecay.co.uk/blog/making-droid-work-with-wikidata/

    #Code #Coding #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #OpenData #PRONOM #siegfried #SoftwareDevelopment #wikidata

  35. Wikidata is a good service, Wikibase (on which Wikidata is built) is a better platform.

    I have spoken before about its potential to be added into the file-format registry ecosystem in a federated model.

    If we are to use it as a registry that can perhaps complement the pipelines going into PRONOM, e.g. in vendor’s digital preservation platforms such as the Rosetta Format Library, a Wikidata should be able to output different serializations of signature file for tools such as Siegfried, DROID or FIDO.

    And what about DROID?

    Conversion to DROID

    It’s not straightforward to say to a Wikibase/Wikidata Query Service, “output XML in the shape of a DROID signature file”, but it is straightforward to write a converter script.

    I had this very thought last week while presenting with colleagues at a File Format Workshop at iPRES in Ghent.

    It dawned on me that the conversion script would actually be simple thanks to a change in format to DROID whereby it can process all its own signatures, where previously it required DROID to pre-process them. It’s a long story, a more simple rendition is that DROID no longer requires DROID byte-code to record information about an identification pattern, and can instead store signatures in the attribute of a byte sequence element as-is, i.e. a PRONOM formatted regular expression from PRONOM itself, or Wikidata.

    This realization resulted in my writing a conversion script (it took just over a half-day) during some down-time on the train home this past weekend.

    The script is called wddroidy (after WD-40 🙄🥁) and can be found here.

    Results

    We can see using the skeleton suite from Richard Lehane’s Builder that we can positively identify files using the new signature file.

    Links can also be made to work with Wikidata identifiers by modifying the PUID URL pattern in the DROID configuration, e.g. to:

    http://wikidata.org/entity/%s

    The screenshot below shows where in the dialog that setting is:

    Reference signature file

    A reference signature file can be found in the wddroidy repository here. There are approximately 8119 file formats listed and 8195 file format signatures for those.

    NB. We know there are different issues with Wikidata including how to identify a “format” and the quality of the signatures. We capture some of these in a global repository: https://github.com/ffdev-info/wikidp-issues/issues

    DROID simplified format

    The real headline here might be how easy it was to create the output using the DROID simplified format.

    I have spoken about it briefly before but not in any detail.

    In-short DROID no longer uses its own byte-code encoding that included strange terms such as DefaultShift, Shift Byte, and SubSequence (instructions to DROID about how to perform Boyer Moore Horspool search). See below and note especially how the bytes are split in Shift Byte attributes and elements:

    <?xml version="1.0" encoding="UTF-8"?><FFSignatureFile xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">  <InternalSignatureCollection>    <InternalSignature ID="1" Specificity="Specific">      <ByteSequence Reference="BOFoffset">        <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">          <Sequence>255044462D312E34</Sequence>          <DefaultShift>9</DefaultShift>          <Shift Byte="25">8</Shift>          <Shift Byte="50">7</Shift>          <Shift Byte="44">6</Shift>          <Shift Byte="46">5</Shift>          <Shift Byte="2D">4</Shift>          <Shift Byte="31">3</Shift>          <Shift Byte="2E">2</Shift>          <Shift Byte="34">1</Shift>        </SubSequence>      </ByteSequence>    </InternalSignature>  </InternalSignatureCollection>  <FileFormatCollection>    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">      <InternalSignatureID>1</InternalSignatureID>      <Extension>ext</Extension>    </FileFormat>  </FileFormatCollection></FFSignatureFile>

    The updated format was made possible via Matt Palmer via his ByteSeek work, and can now except a regularly encoded PRONOM formatted regular expression (regex) in an attribute in the ByteSequence element. See here for a signature file equivalent to the above:

    <?xml version="1.0" encoding="UTF-8"?><FFSignatureFile      xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">  <InternalSignatureCollection>    <InternalSignature ID="1" Specificity="Specific">      <ByteSequence Reference="BOFoffset" Sequence="255044462D312E34" Offset="0" />    </InternalSignature>  </InternalSignatureCollection>  <FileFormatCollection>    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">      <InternalSignatureID>1</InternalSignatureID>      <Extension>ext</Extension>    </FileFormat>  </FileFormatCollection></FFSignatureFile>

    The format is much easier to read, and after a bit of time sitting with the DROID signature file format you realize it is fairly easy to output as well. I use some very rudimentary templates in wddroidy using  Python’s f-strings.

    It means other sources of PRONOM encoded signatures can output much simpler signature files and they can be used by DROID. I myself need to add it to the signature development utility – this would allow the utility to run standalone on anyone’s PC.

    One next step for this approach might be to confirm that it does work entirely as expected by extracting all of PRONOM’s signatures proper and performing a mapping to the simplified format – if we can match against all the skeleton files in the latest Builder release then we should be looking good!

    Priorities

    I am always reminded, but always forget about priorities! This is part of how DROID resolves a file format into a single identifier, e.g. where SVG can match XML, we often want the more specific format returned, and so a priority is used to prioritize that one over the other, resulting in a single unambiguous identification for the DROID user. It manifests in the signature file as:

    <FileFormat ID="634" MIMEType="image/svg+xml" Name="Scalable Vector Graphics"PUID="fmt/91"Version="1.0">   <InternalSignatureID>24</InternalSignatureID>   <Extension>svg</Extension>   <HasPriorityOverFileFormatID>638</HasPriorityOverFileFormatID> </FileFormat> More work needs to be done with Wikidata to understand if priorities can be properly applied to a DROID signature file. They are not written into the reference signature file above.

    Using the results

    Using the results can be done for two things:

    1. (Probably) There are a greater number of patterns in the Wikidata output than in PRONOM. If you have a file that remains unidentified, you can try the reference file for clues as to what it may be. I’d only use caution and investigate the exact byte sequence used for a match and understand its properties. I’d also check that the mapping also looks accurate, I’ve tried one or two runs using the identifier and it looks good, but there may still be mistakes.
    2. For improving the quality of the sources in Wikidata. As you can see from the Skeleton suite there are a lot of gaps. We a) have a rough idea what these are, and b) know the identification doesn’t work via Wikidata. Why is that? Is the signature in Wikidata simply not good enough? Are patterns missing? Is there another error or issue we can help with given our expertise in file format identification?

    Hacking wddroidy

    You can hack wddroidy. Currently it allows you to limit the number of results returned, and also modify the ISO language code used by the tool. You can see this in the command line arguments:

    python wddroidy.py --helpusage: wddroidy [-h] [--definitions DEFINITIONS] [--wdqs] [--lang LANG] [--limit LIMIT] [--output OUTPUT] [--output-date] [--endpoint ENDPOINT]create a DROID compatible signature file from Wikidataoptions: -h, --help show this help message and exit --definitions DEFINITIONS   use a local definitions file, e.g. from Siegfried --wdqs, -w live results from Wikidata --lang LANG, -l LANG change Wikidata language results --limit LIMIT, -n LIMIT   limit the number of resukts --output OUTPUT, -o OUTPUT   filename to output to --output-date, -t output a default file with the current timestamp --endpoint ENDPOINT, -url ENDPOINT   url of the WDQSfor more information visit https://github.com/ross-spencer/wddroidy

    The actual SPARQL query used can be manually edited in the src folder. E.g. you can limit the query by format or family or classification. I provide some more inspiration in the Siegfried Wiki.

    Let me know if it’s useful!

    This is really just a quick hack and it needs a lot more testing to improve the quality of the output. Most can be dealt with on the Wikidata side I am sure, but some might need to be done in the tool. If it’s useful, reach out, and let’s discuss what can be changed or how it can be used in your work.

    Data quality

    It will quickly become apparent the data quality isn’t what it is with PRONOM and that is why a curated and authoritative service such as PRONOM is always going to be needed. As mentioned in previous talks, this can in theory be complemented with downstream data in federated databases. This might mean curating Wikidata better using some of the tools available, or curating data into a Wikibase (the platfom Wikidata is built upon). Both options bring different benefits and advantages such as creating a bigger tent of signature developers on Wikidata, or, another example, more expressive signatures being made available via federated Wikibases.

    And a word on Wikiba.se

    A reminder too, that setting up a Wikibase can take some effort (I was once running three at the same time 😬) but a service called https://wikiba.se/ exists. wikiba.se could form an excellent scratch pad to begin thinking about mapping PRONOM like data to a Wikibase and also begin solving some of the other issues around mapping container signatures and outputting those in a way that is compatible for DROID. Let me know if you give it a whirl, or want to collab on any of that.

    Otherwise, thanks in advance! And enjoy wddroidy!

    https://exponentialdecay.co.uk/blog/making-droid-work-with-wikidata/

    #Code #Coding #digipres #DigitalPreservation #DROID #FileFormat #FileFormats #OpenData #PRONOM #siegfried #SoftwareDevelopment #wikidata

  36. What information is in a file format identification report?


    by @beet_keeper

    In early 2022, I was finally able to get around to writing a paper that I had been thinking about for the better part of a decade. The paper, “Fractal in Detail: What Information Is in a File Format Identification Report?” was published in the Code4Lib journal Issue 53.

    The paper takes a deep dive into the fractal contents of file format identification reports exported from tools like Siegfried and DROID.

    Let’s take a brief look the article and its contents below.

    Continue reading “What information is in a file format identification report?”

    #code4lib #code4libJournal #digipres #digitalPreservation #droid #fileFormatAnalysis #fileFormatIdentification #fileFormats #filedriller #formatIdentification #freud #linting #metadata #preservationMetadata #pronom #puid #puids #siegfried #staticAnalysis #technicalMetadata

  37. simpledroid: completing the loop


    by @beet_keeper

    It’s nearing the end of 2024 and that must mean a PRONOM hackathon as part of the World Digital Preservation Day (#WDPD2024).

    My contribution is a follow-up on my work earlier in the year to produce a valid DROID signature file from Wikidata in wddroidy.

    Continue reading “simpledroid: completing the loop”

    #digipres #DigitalPreservation #DROID #FileFormats #PRONOM #Python #siegfried #SkeletonTestCorpus #WDPD #WDPD2024