home.social

#docfs — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #docfs, aggregated by home.social.

  1. @carapace Some of the most interesting ideas I've seen are in the Ploan9 OS and specifically its 9P protocol:

    en.wikipedia.org/wiki/Plan_9_f

    en.wikipedia.org/wiki/9P_(prot

    That includes a /webfs concept where remote networked resources are accessible via filesystem semantics. That's a concept that's been adopted to some extent on other operating systems, notably Sun Solaris and its ability to automount NFS shares (something I've seen ... abused rather heavily in some shops), and in some Linux filesystems, largely using FUSE (Filesystem in USErspace) en.wikipedia.org/wiki/Filesyst.

    I'll note that when you're on the systems side of things it's quite helpful to have canonical and invariant names for data resources. Mixing and matching this with a documents-oriented filesystem might not lead to happy places.

    #Filesystems #webfs #docfs

    5/end/

  2. @carapace One notion I'd arrived at was that in the case of catalogue access, search is identity.

    That is, a search which turns up a single record or document is definitionally an identity for that document.

    That identity might be a standard assigned value, such as an ISBN, DOI, or Library of Congress call number, or it could be some distinct set of parameters, say, a combination of author, title, and publication date, which return a single record.

    Note that a search which is an identity in one archive or at one point in time might not be an identity for another: an identity returns a single record, whereas a search might return several, one, or no records.

    One notion I have is of using a filesystem-like syntax for search, so that, say, /docfs/au:steinbeck/ti:grapes might turn up records related to John Steinbeck's Grapes of Wrath. Here, /docfs is a virtual filesystem which provides an interface into the documents filesystem. Specific assigned identifiers might be referenced as /docfs/id:isbn:0330881043 (again: Steinbeck's Grapes of Wrath).

    #Filesystems #webfs #docfs

    4/

  3. @carapace I've put a fair bit of thought into how a document-oriented filesystem (in the #PaulOtlet sense of "document") might function. To the extent I've thought this out, it's somewhat modelled on how a physical library is organised: there is the actual storage ("stacks"), and there's the interface to that storage, ("catalogue").

    A document is any contained information. It might be a text, image, sound, video, multimedia, or data record, or combination of these.

    The stacks contain works. The catalogue provides ways of accessing those works, and any given work might appear in or be accessed through the catalogue in any number of different ways.

    A huge challenge for any such metadata-based system is that metadata itself requires design and creation, and this remains hugely cumbersome for data presently. There's some useful metadata associated with filesystems, though much of that is at a systems rather than document level, and some metadata (say, file creation / modify / access timestamps) bears little if any relation to the underlying document. Tracking document-related metadata would be a huge step forward.

    Relying on extant and often external metadata would also be useful. Library of Congress, OCLC, IMDB, CDDB, DOI, ISBN, and related records would be quite useful for classifying existing works. Some set of useful standards for other common records (personal computer files, system logs, receipts, memos, correspondence, online interactions) might also be useful. The more that metadata creation can be both automated and made useful (and no, "New Document" is not a useful title) the better.

    #Filesystems #webfs #docfs

    3/

  4. @carapace The problems with replacing the classic hierarchical filesystem are much the same as swapping out any other piece of well-established standards:

    1. You've got to have an exceptionally compelling alternative.

    2. There's a hell of a lot of legacy that relies on extant systems.

    3. Agreeing on a specific replacement (or set of replacements) creates its own huge coordination problem.

    I'd be interested in hearing how you're addressing each of those points.

    #Filesystems #webfs #docfs

    2/

  5. @carapace One question I'd toss out is: where did the notion of hierarchical filesystems first emerge?

    I'm familiar with Linux / Unix, a whole slew of PC-based systems (DOS, CMS, Classic Mac), as well as MVS (TSO/ISPF) and VMS. Linux is certainly where I feel most at home.

    IBM mainframes (MVS) had a one-level hierarchy, effectively you could create any number of folders at the root filesystem level, and place files in those, but nested files weren't A Thing. I suspect that in this regard, IBM was trying to emulate paper-based filing systems where cabinets held folder and folders individual records, but nesting was distinctly limited.

    Nested filesystems may date to Multics if GPT is to be trusted. Wikipedia supports this: en.wikipedia.org/wiki/File_sys

    #Filesystems #webfs #docfs

    1/

  6. @RussSharek Ayup. I'm headed that way.

    One of my recent finds that's been game-changing has been "Save as ePub", a feature of the #Einkbro browser (Android). That's a fork of the FOSS Browser, which might have similar functionality.

    Effectively, you can save a Web article as an ePub, or append it to an existing ePub, which means you can effectively "build your own book" of relevant content (a project, good articles over a specific time period, work-related project, stuff to share with someone else). For tablets / mobile devices this is about the best option I've found, preferable to saving PDFs, with the one exception that most metadata concerning the saved content is lost. I'm not sure the source URL is kept, the date is certainly lost.

    The #webfs and #docfs tags in my first toot above refer to a project I've been kicking around for managing documents and articles, both Web and otherwise. I'm tending strongly toward a plain-text baseline format (with markup languages such as Markdown, LaTeX, djot, etc., being ways of extending basic structure and capabilities), also with extensive bibliographic metadata. It's all pretty much vapourware but it's fun to think about.

  7. So, Pocket, the article-archival tool that keeps getting worse the more you use it, has just become immeasurably worse.

    I've reverted from version 8.6.x to no, not 8.5, not 8.4, not 8.2, but 8.1.1.0 from freaking February of this year to revert these completely fucking brain-dead changes.

    The TL;DR: link is apkmirror.com/apk/mozilla-corp

    That's what you want to install and freeze on until Pocket catches a motherfucking clue.

    I've had a long an unhappy relationship with this feature and app. Its sole claims to my continued use are that it holds nearly 5 GB of content hostage, and that it, unbelievably, seems to be the best of what is an immensely shitty application space. See my now-six-year-old rant virtually all of which remains valid: web.archive.org/web/2019051209

    Most recently, Pocket has lost two features:

    • A "page flip" mode, which though itself hugely flawed, is better than scrolling through articles, especially on e-ink devices.

    • The ability to view all articles either in the (hugely preferable, very useful) #ReadabilityJS view, or in-app in a "web view". The latter now revert to your device's default Web Browser app on mobile devices.

    The problem with that latter is that the task of annotating and tagging articles (my principle remaining justification for Pocket) is made vastly more tedious --- and it's already more than adequately tedious in previous Pocket versions. To the point it's not even worthwhile.

    Fortunately, I was able to hunt down a prior version of the app (using the APKMirror app), and I will not be upgrading Pocket beyond the most recent version I can find which still supports both Page Flip and Web View modes, as noted above 8.1.1. from 17 February 2023. (Few if any of Pocket's "improvements" over the past five years have had any value to me whatsoever, so this is little loss.)

    There is of course a Relevant xkcd: "Software Updates":

    xkcd.com/2224/

    I would so like to see a useful document-management solution for tablets and e-ink devices with the ability to managed both offline and online (Web-based) content.

    Boosts and re-sharing this on other platforms is strongly encouraged.

    Edits: I'm updating this toot as I'm finding out more. In particular, what version(s) of Pocket are NOT affected by these changes is not yet clear.

    #Pocket #GetPocket #MozillaPocket #Mozilla #ApkMirror #EInk #DocumentManagement #xkcd #xkcd2224 #kfc #webfs #docfs

  8. Whitespace in filenames is a major category error IMO.

    OTOH, filenames themselves (and filesystems as presently incarnated) are also grossly insufficient for many needs. It's interesting to note, for example, that on Android (and possibly iOS), databases (usually sqlite) have emerged as the de-facto default persistent data storage mechanism, even for content which would normally be held on a filesystem.

    I've long been looking at questions such as what a document-oriented filesysem (#docFS) or the World Wide Web as fileystem accessible (#webFS) might look like.

    For documents, I've generally arrived at a naming standard which uses underbars (_) to separate elements, hyphens (-) for standard whitespace, and double dashes (--) to indicate punctuated / multiple element (e.g., multiple authors, or a subtitle following a colon or dash). Permitted characters are otherwise 7-bit ASCII alphanumeric ([A-Za-z0-9], with dot as a file extension only, and possibly parentheses.

    So:

    Author-One--Author-Two_Title--Subtitle_YYYY.filetype

    That might have a publisher or journal title added (additional underbar-delimited element after the title(s). Additional contributors (e.g., editors, translator) might be mentioned. And it's possible some identifier (ISBN, OCLC, DOI, LoC call number) might be added, though those are supplemental.

    The idea isn't to fully and completely or precisely represent all aspects of a document or work, but to usefully do so. So yes, that means that foreign charactersets aren't presented, that full author lists aren't included (for scientific paper these can number in the tens to hundreds), etc. But enough to find the work reasonably within a corpus through a directory listing.

    Yeah, I'm familiar with Calibre, Zotero etc., and should really get more familiar with them. But they're clunky enough and not sufficiently universally available (e.g., on Android, where most of my documents live these days, via an e-book reader) that I'm not optimistic they're really a solution.

    (Hoisted from a limited share.)

    #DocumentManagement #Whitespace #OnTheNamingOfCats #OnTheNamingOfFiles #Whatever #SameThing #RockyHorror #MacavitysNotHere #Bombalurina #Effanineffable #OldPossum #TSEliot #DOS #PaulOtlet #Mundaneum

  9. @alcinnz So, effectively a filetype:application association manager. file(1) and magic(5) on steroids.

    I am thinking of managing metadata associated with documents, works (multiple forms / manifestations of a single document), projects and workflows (involving various records, etc), and the overall document lifecycle: creation, acquisition, cataloguing, use, adaptation, distribution, destruction.

    That's what I've lumped under my #webfs and #docfs concepts, along with #kfc (Krell Functional/Fucking Context).

  10. @billjanssen Thanks again. Some of that looks ... closer. Cone Tree and Perspective Wall most so, though still not quite there.

    Are you associated with this research/develpment, or just an interested party?

    One thing I've thought about considerably as I'm increasingly using e-book readers and being frustrated by their own document management / organisational limitations, is how physical library space maps, with multiple dimensional convulutions, to stored data:

    There's a mix of physical and logical organisations:

    character -> word -> line -> page - > signature -> book

    character -> word -> sentence -> paragraph -> chapter -> book

    Shelf -> bookcase -> aisle -> floor -> building

    A book (nominally: 250 pages) is about 125k words.

    About 32 books fit to a shelf, 8 shelves to a bookcase, say, 16 bookcases to an aisle, 16 aisles to a floor. (I'm biasing to powers-of-two numbers here)

    That's 256 books per case, 4,096 per aisle, 65,536 per floor.

    (A fairly large community library is on the order of 300k books, or about 4 floors as I've defined them. A large university library, 122 such floors. Based on my experience, I may be underspecifying density, and would be interested in actual data.)

    And so on.

    The point I'm trying to make though isn't about density but of navigation of that space. The reader/researcher can go to a specific book, or to a shelf (closely related works), an aisle, a floor, etc. There's a different level of aggregation at each point in the scale, and for topically-organised (e.g., Library of Congress classification or Dewey Decimal), a specific region corresponds largely with a specific subject grouping.

    On my e-book reader, I'm effectively limited to only one level of aggregation: a sequential shelf scan of books. With storage exceeding several TB, and an average book size of ~1--5 MB, that's effectively a fairly large community library worth of potential documents which can be carried in one's hand or satchel, but for which the organisational capabilities are ... exceedingly limited.

    This remains a major frustration of mine.

    #KFC #DocFS #WebFS #Libraries #DocumentManagement

  11. @Researchbuzz The proximity element is limited as I am, of course, on Altair IV, some 20 of your light years away.

    That said, one of my obsessions (though not necessarily a major element of my Mastodon tooting) is information, knowledge, and document management.

    The tags #kfc, #webfs, and #docfs will lead to a few of my information-management / search toots / threads.

    And if you've got opinions, feelings, and/or deep intel on #PaulOtlet and his #Mundaneum I'm all ears.

    @woozle

  12. @jonny My principles here are:

    • The filename should be descriptive and not simply unique.
    • It should be human-meaningful in some manner if at all possible.
    • It should scope to the collection size / namespace.

    Estimates I'm aware of are that there are on the order of 100--200m books ever published, growing at ~1m year, and a generally comparable set of scientific articles. News organisations such as Reuters, AP, and AFP produce about 1k--5k items daily, and I suspect many of those are photos or videos. Major newspapers tend to produce about 100--500 stories daily (weekday vs. weekend). You can work out ballpark maths from that.

    For correspondence, the originator and recipient ("From:" and "To:" are both significant. Those might be referenced. Publishing, to a general audience, is in a sence correspondence where "From:" == Author and "To:" == World.

    The filename need not be precise, exact, or an accurate presentation of conents, but USEFUL. That is, within a corpus, can I find a specific work or works of interest. In this sense, the titling scheme is an example of the principle I've developed that search is identity, in the sense that a search might produce 0, 1, or n>1 results. 0 is null, 1 is identity, and > 1 is a result set.

    There are other naming and cataloguing schemes. A complete system would have correspondences between these and the conventional / human-readable titles, e.g., ISBN, LOCCS, OCLC, DOI, etc.

    And yes there are other cataloguing systems such as SuDoc (used by the US government) which are useful in their own contexts.

    Author, date, content, audience, and publisher are generally useful search-space reducing concepts of fairly generally applicable context. E.g., if I were including, say, store receipts or purchase orders, the vendor, customer, date, location, and a summary of contents (say, largest item) a description. Computer logs tend to be time and process/service oriented, perhaps also mentioning user or network address, etc.

    Related hashtags and discussion:

    #docfs #webfs #KFC #PaulOtlet #Maundenaum

  13. @Valenoern This is the essential idea behind "docfs", which would be a document-oriented filesystem. Its networked sibling being "webfs".

    "Document" here is in the sense of #PaulOtlet, of any durable record. That might be a text, image, sound, video, multimedia content, data, software, or an amalgamation or melange.

    One of my key ideas is that the metadata for these documents would be part of the filesystem, extending the notion of what constitutes file-centric data. I'd like to see some form of bibliographic data presented, where available for public and published media (book, articles, audio recordings, films).

    Search is another element, and one idea for the filesystem would be as a virtual filesystem in which attributes could be supplied until a single item matching those criteria was found. "Identity is search".

    For projects, some concept of structured workflows, with groups, tasks, milestones, and contributing data. For a sufficiently structured organisation, security and access controls.

    I'd like the whole concept to be as commercialisation-hostile as possible, with both copyrights and payments entirely out of scope.

    #docfs #webfs #kfc #maundenaum #DublinCore #metadata #bibliography #Plan9OS #Schopenhauer

  14. @CyberpunkLibrarian I'd very much like that.

    I've been half-assedly kicking around an idea to build such a thing, generally referred to as #KFC (Krell Functional Context / Krell Fucking Context, variously). See also #WebFS and #DocFS which relate: accessing the Web as a filesystem (see Plan9OS) and a documents-oriented filesystem in which "paths" are actually "search queries" through various spaces (author, title, pubdates, subjects / keywords, publishers, identifiers ISBN/OCLC/LOCCN/DOI, etc).

    The results of any path specification are strictly one of:

    • No results (a failed search).
    • One result (an identity search, at least at the time performed).
    • Multiple results (a set). Which might be variously small or large (I'm thinking of some vaguely logrithmic scale for classifying this.)

    I'd also like to see workflow included, some sense of a cataloguing workflow (desired, aquired, classified, converted (to some minimally-sufficient complexity best format, which is to say, LaTeX 😺 ) privacy scopes and controls, and relations between works (citations, references, translations, authors, concepts, projects, ...)

    Mind, this is all but entirely vapourware.

    @FiXato

  15. @thornAvery My own approaches are:

    • Find LITERALLY ANY FORMAT OTHER THAN PDF. HTML, text, ePub, etc., if possible.

    • Try pdftotext, part of Poppler utils: poppler.freedesktop.org/ This is available for most Linux distros, MacOS under Homebrew, or check out via Git.

    If I can get something vaguely reasonable, that's usually sufficient.

    • OCR is an option. I've never had good luck with that, and there's such a tremendous amount of tendous correcting that retyping is frequently preferable. That said, I operate at fairly low scale.

    • Retype by hand. Since I'm usually reading the work, this actually turns out to be a pretty good reading method for content-retention.

    PDF itself is a container around a bunch of other formats. Asking how to convert a PDF is a bit like asking how to cook a bag full of groceries. It really depends on what's in it, and what you're hoping to get.

    #PDF #PDFConversion #kfc #docfs #webfs

  16. @thornAvery I'm trying to find what I thought I remembered as an excellent HN comment discussing how to do this at scale.

    It turns out to be really complicated.

    That said, maybe tell us what it is you're trying to do, specifically:

    • How many documents.
    • How large.
    • What languages / charactersets.
    • What budget (if any).
    • What end-use.

    #webfs #docfs #kfc #PDFConversion #pdf

  17. @thornAvery There's no such creature that will cover all cases. You may get lucky in many instances with easier options.

    Your best bet is to find another form of the document that's closer to text. For many published documents there are good odds of this.

    If the PDF is actually rendered from a text source, pdftotext is pretty good at extracting the actual text.

    If it's not ... you're left with a much more challenging job. I find with rather startling frequency that simply re-typing the document from scratch is often the best option.

    #pdf #PDFConversion #kfc #docfs #webfs

    1/

  18. The US Federal Government probably produces more documents than any other entity on Earth.

    Adelaide Hasse (1868--1953) is the public-schooled, self-taught OG BAMF who created the indexing and classification system which still organises that to this day, the Superintendent of Documents Classification System (SuDoc).

    en.wikipedia.org/wiki/Adelaide

    #AdelaideHasse #SuDoc #LibraryClassification #DocumentManagement #kfc #docfs #webfs #libraries

  19. The US Federal Government probably produces more documents than any other entity on Earth.

    Adelaide Hasse (1868--1953) is the public-schooled, self-taught OG BAMF who created the indexing and classification system which still organises that to this day, the Superintendent of Documents Classification System (SuDoc).

    en.wikipedia.org/wiki/Adelaide

    #AdelaideHasse #SuDoc #LibraryClassification #DocumentManagement #kfc #docfs #webfs #libraries

  20. @mdhughes The metadata problem is one that I've been working at (very slowly) for some years now.

    Go through my #KFC #docfs and #webfs tags for some context (it's all very loose). But a goal would be to extend filesystem metadata probably along the lines of Dublin Core Metadata, as well as Some Other Stuff.

    It's ... complicated.

    I don't see the application being one that's suited filesystem-wide (though I could be wrong on that). A documents-archive-specific filesystem (or extension / overlay) would be very much relevant.

    @wftl

  21. @brennen I've been noodling at a concept under the rubric of #webfs / #docfs / #kfc for a few years, which is ultimately strongly influence by Paul Otlet, Ted Nelson, Vannevar Bush, etc. Which is to say, far smarter people than me have tried and failed. But where would you be if you didn't try, as Lyell Lovett said.
    invidious.snopyta.org/watch?v=

    I think any system is going to have to have a review-and-cull stage, and you'll have to schedule (time-budget) for those steps. This includes the calendrical bookmarks model.

    The heart of "KFC" (Krell Functional Context) is the idea that "identity" for a document is a search function, and if you can define a search that returns a document, you've identified it, at least from a strictly pragmatic view.

    Metadata --- title, author, date, subject, references, citations, tags, keyword search --- are all search indicia. And the key for your system is to provide a useful way of returning to some prior work.

    Discovery / use context is also a search criterian. That's what tree-style tabs offers, though it's a fragile and brittle association that cannot be saved by any mainstream browser of which I'm aware.

    2/

  22. @brennen Short answer: not that I've found, yet, to my satisfaction.

    Some things that don't work IME:

    • Google Chrome, especially on Android. Anything past ~5 tabs is utterly unmanageable. My "close tabs" post-it note is dated April. Of 2015.

    • Pocket. See old.reddit.com/r/dredmorbius/c

    • Tree-Style Tabs. A good start, but ultimately not the solution.

    • Zotero. I just don't think like it does. Friction is too high.

    • Save-as-PDF. You now have an unorganised pile elsewhere.

    • Zettlekasten / index cards. Useful, but too high-friction for online stuff.

    1/

    #TabManagement #InformationManagement #DocumentManagement #kfc #docfs #webfs

  23. On Paperwork vs. Digital Formats

    tired: Our customer's paperwork is profit. Our own paperwork is loss.[1]

    wired: Your proprietay data format is loss. Our proprietary data format is profit.

    I'd remembered the first aphorism from a long-ago collection of Murphy's Laws.

    Thinking through my struggles at organising online and digital media, references, etc., I realised that a huge problem is that these formats don't serve my goals. They're designed far more around their authors' goals, or even more often, the publishers' goals, largely around advertising, marketing, tracking, building lock-in, creating and defending monopolies, and the like.

    Digital formats that are in the end-user's interest and specification serve the user. Those that are in the publisher's specification serve the publisher.

    A related thought is that a key affordance of printed periodicals (newspapers, magazines, journals) is that of garbage collection, to put a contemporary spin on it.

    When you're done reading a newspaper or magazine, you pick up the whole lot and throw it out. There's an intermediate level of organisation other than "the article" and "the whole collection" (that is, everything published in your office or home), "the issue". (Or perhaps a box or shelf of archived media.) That is, _there are multiple naturally-occurring levels of aggregation.)

    When you're trying to sort through a set of browser tabs, you generally have only two levels of aggregation: the individual tab, or the entire session. There are typically no intermediate levels, and sorting through what you want to keep (or re-read, or work with) means you've got to go through the set one at a time and resolve disposition. The data format serves the browser vendor, but not the user.

    Tools such as Tree-Style Tabs, an absolutely essential Firefox extension, give a higher level of natural organisation, the tab tree. Here, a structure emerges, without user effort, of related content. At the top of the tree is whatever page began an exploration, and as you descend it, you go further down into the search. When cleaning up, it's possible to pick any given tab, branch, or whole tree, and close it out in one fell swoop. Garbage collection costs are reduced.

    (Three guesses as to what I've been attempting to do, and the first two don't count.)

    #media #paperwork #DigitalMedia #DigitalFormats #FileFormats #DataFormats #kfc #docfs #UserCentricDesign #TreeStyleTabs

  24. #DearMastomind: What different types / uses of tables can you think of?

    I'm looking with a mind to document / Web formatting and styles.

    Of the top of my head:

    • Short lists (such as this one), which are effectively single-column tables usually with far too much text crammed into a single line (such as this one).
    • Data tables.
    • Textual tables --- generally with few or no quantitative cells.
    • Simple (or not-so-simple) spreadsheets, offering sort, totals, and/or other summary statistics, possibly subsetting or cross-tabulation capabilities, in an interactive or at least intelligently-computed sense.
    • Graph-adjacent tables. Data tables which are directly related, possibly interactively linked, to some data visualisation(s).
    • Tabular layout. Tables used principally to organise and arrange longer bits of textual content. Need not be a classic HTML table layout or grid, though approaches this.

    Different uses might have different formatting, including borders, "greenbar" separators, interactive sort or filtering capabilities (typically created now with Javascript, though native browser support might be handy), etc.

    If you can think of a good discussion or reference addressing this question that would also be helpful.

    #layout #tables #html #css #latex #docfs #webfs #kfc #browsers

  25. @cadadr Very much this.

    Zettlekasten (or any effective notetaking system) is not about structuring your content so much as it is about enabling structure discovery.

    As such, the absolute core functionality is enabling structure discovery. If it fails to serve this, it's broken. Frictions impeding this (or its precursors) must be reduced.

    Incidentally, most of my criticisms of #Pocket boil down to the fact it hugely impedes structure disccovery.

    @temporal

    #kfc #docfs #webfs