#heritrix — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #heritrix, aggregated by home.social.
-
Parece que la Biblioteca Nacional de España tiene un proyecto por el que pretenden crear un archivo de la web española.
https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmastersAhí indican que usan #Heritrix , creado por Internet Archive. Al buscar más información sale http://crawler.archive.org/ (sí, una página sin https sino http), y al comienzo pone en grande "Obsolete"
For latest information see https://webarchive.jira.com/wiki/display/Heritrix
Y pinchas ahí y te fuerzan a hacer login.
Vaya mierda de documentación si te obligan a hacer login -
What are your favorite / the best #WebCrawlers for broad / #WebScale #crawling?
I've built a list but am looking for anything I missed: https://github.com/davidshq/awesome-search-engines/blob/main/WebCrawlers.md
Main options I've found include #Apache #Nutch, #StormCrawler, #Scrapy, #Norconex, #PulsarR, #Heritrix, and #sparkler
-
What are your favorite / the best #WebCrawlers for broad / #WebScale #crawling?
I've built a list but am looking for anything I missed: https://github.com/davidshq/awesome-search-engines/blob/main/WebCrawlers.md
Main options I've found include #Apache #Nutch, #StormCrawler, #Scrapy, #Norconex, #PulsarR, #Heritrix, and #sparkler
-
What are your favorite / the best #WebCrawlers for broad / #WebScale #crawling?
I've built a list but am looking for anything I missed: https://github.com/davidshq/awesome-search-engines/blob/main/WebCrawlers.md
Main options I've found include #Apache #Nutch, #StormCrawler, #Scrapy, #Norconex, #PulsarR, #Heritrix, and #sparkler
-
What are your favorite / the best #WebCrawlers for broad / #WebScale #crawling?
I've built a list but am looking for anything I missed: https://github.com/davidshq/awesome-search-engines/blob/main/WebCrawlers.md
Main options I've found include #Apache #Nutch, #StormCrawler, #Scrapy, #Norconex, #PulsarR, #Heritrix, and #sparkler
-
What are your favorite / the best #WebCrawlers for broad / #WebScale #crawling?
I've built a list but am looking for anything I missed: https://github.com/davidshq/awesome-search-engines/blob/main/WebCrawlers.md
Main options I've found include #Apache #Nutch, #StormCrawler, #Scrapy, #Norconex, #PulsarR, #Heritrix, and #sparkler
-
@elan also see #Heritrix and of course #StormCrawler as alternatives to #ApacheNutch