home.social

#archivebox — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #archivebox, aggregated by home.social.

  1. How to Install and Run #ArchiveBox on #Ubuntu #VPS Server in 5 Minutes (Quick Start Guide)

    This article provides a guide for how to install and run ArchiveBox on Ubuntu VPS server.
    What is ArchiveBox?
    ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view websites offline. Without active preservation effort, everything on the internet eventually ...
    Continued 👉 blog.radwebhosting.com/install #vpsguide #opensource #selfhosting #installguide #selfhosted

  2. I've mirrored a relatively simple website (redsails.org; it's mostly text, some images) for posterity via #wget. However, I also wanted to grab snapshots of any outlinks (of which there are many, as citations/references). By default, I couldn't figure out a configuration where wget would do that out of the box, without endlessly, recursively spidering the whole internet. I ended up making a kind-of poor man's #ArchiveBox instead:

    for i in $(cat others.txt) ; do dirname=$(echo "$i" | sha256sum | cut -d' ' -f 1) ; mkdir -p $dirname ; wget --span-hosts --page-requisites --convert-links --backup-converted --adjust-extension --tries=5 --warc-file="$dirname/$dirname" --execute robots=off --wait 1 --waitretry 5 --timeout 60 -o "$dirname/wget-$dirname.log" --directory-prefix="$dirname/" $i ; done

    Basically, there's a list of bookmarks^W URLs in others.txt that I grabbed from the initial mirror of the website with some #grep foo. I want to do as good of a mirror/snapshot of each specific URL as I can, without spidering/mirroring endlessly all over. So, I hash the URL, and kick off a specific wget job for it that will span hosts, but only for the purposes of making the specific URL as usable locally/offline as possible. I know from experience that this isn't perfect. But... it'll be good enough for my purposes. I'm also stashing a WARC file. Probably a bit overkill, but I figure it might be nice to have.

    #RedSails #archive #archival #archiving #warc

  3. Wow! #TIL about #ArchiveBox, your #selfhosted #alternativeTo @internetarchive!

    Runs on #Python (OS-packaged or #docker‬ed) and saves both single pages or whole website crawls in every format you could wish for:

    ✅ self-contained single-page HTML
    ✅ PDF
    ✅ PNG screenshot
    ✅ plaintext
    ✅ DOM-dump
    ✅ priv./publ. #archive
    ✅ media audio/video included (+yt-dlp)
    #WARC compat.

    🌐 archivebox.io
    📜 github.com/ArchiveBox/ArchiveB
    demo.archivebox.io

    #WebArchiving #WebCrawling #DigitalPreservation

  4. @EpicCyndaquil For multiple tabs management, I had used #tabCloud years ago: chrometabcloud.appspot.com/ It was working super well.

    Although I'm not sure how or even if it's working right now, possibly obsolete, but you could backup and restore tab sessions using it, but maybe it'll be a nice entry point for an alternative if you're looking for.

    I rely on @brave 's tab sync feature right now, which is working so far so good.

    Thanks for the #archivebox and #moreso options. I'll check them as well!

  5. Gleiches bei dem Versuch, #archivebox flüssig zum Laufen zu bringen. Eine absolute Shitshow. Bin jetzt bei der #Singlefile Firefox Extension gelandet. Landet in #Obsidian und dank des Read-HTML-Plugin dort lesbar. Hätte ich mal viel früher drauf kommen sollen.