#stormcrawler — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #stormcrawler, aggregated by home.social.
-
What are your favorite / the best #WebCrawlers for broad / #WebScale #crawling?
I've built a list but am looking for anything I missed: https://github.com/davidshq/awesome-search-engines/blob/main/WebCrawlers.md
Main options I've found include #Apache #Nutch, #StormCrawler, #Scrapy, #Norconex, #PulsarR, #Heritrix, and #sparkler
-
@elan also see #Heritrix and of course #StormCrawler as alternatives to #ApacheNutch
-
Call to all #StormCrawler users: we will release a new version shortly so that people can benefit from the latest additions (#Opensearch) and improvements (#WARC). Any chance you could test some crawls with the latest code in the main branch and report any issues? Thanks
-
A very nice contribution to #StormCrawler improving the generation of #WARC files