home.social

#advertools β€” Public Fediverse posts

Live and recent posts from across the Fediverse tagged #advertools, aggregated by home.social.

  1. Using a proxy while crawling

    This is another feature of using the meta parameter while crawling with #advertools.

    It's as simple as providing a proxy URL.

    There is also a link to using rotating proxies if you're interested

    bit.ly/3SXh8b8

    #crawling #scraping #scrapy #proxy

  2. Happy to share a new release of #advertools v0.16

    This release adds a new parameter "meta" to the crawl function.

    Options to use it:

    πŸ”΅ Set arbitrary metadata about the crawl
    πŸ”΅ Set custom request headers per URL
    πŸ”΅ Limited support for crawling some JavaScript websites

    Details and example code:

    bit.ly/3SXh8b8

    #SEO #crawling #scraping #python #DataScience #advertools #scrapy

  3. Day 75 of #100DaysOfCode

    User-agent parser app refactor and update

    πŸ”΅ Upload a list of user-agent strings
    πŸ”΅ Get them split into their components (OS, family, device, brand, version...)
    πŸ”΅ Download parsed UA's to a CSV file
    πŸ”΅ Interactively visualize the UA's on multiple levels using any of the components

    bit.ly/3EEPRkv

    #advertools #SEO #DataVisualization #DataScience #logfiles

  4. Day 15 of #100DaysOfCode:

    Created a bunch of custom #advertools crawlers, with one line of code each.

    You can set your own defaults (e.g. follow_links=True by default?)

    Examples:

    πŸ”΅ Exploratory crawler: spider mode, on. Stop after 2k URLs.
    πŸ”΅ Rude crawler: Don't obey robots rules.
    πŸ”΅ Polite crawler: Obey robots (default), crawl very slowly, with long periods between crawled pages.

    #DataScience #crawling #scraping #scrapy #SEO #Python #data

    1/2

  5. Day 14 of #100DaysOfCode:

    Created a tutorial on analyzing millions of URLs:

    πŸ”΅ 2.4M URLs from a web server log file
    πŸ”΅ Splitting into their components creates a 5.7GB (giga) DataFrame
    πŸ”΅ Using the new output_file parameter saves the same data in a 67MB (mega) file
    πŸ”΅ Read only the columns you want, while filtering for a subset of rows
    πŸ”΅ Enjoy!

    Notebook and video:

    bit.ly/49socSd

    #DataScience #Python #logfile #URL #SEO #advertools

  6. Day 8 of #100DaysOfCode:

    Added the option to specify custom date formats for log files:

    πŸ”΅ advertools.logs_to_df will attempt to convert datetime columns to a datetime type according to default formats
    πŸ”΅ Supply your own date format if your logs have a different one (or if you decide to change it)
    πŸ”΅ Date format will be using the strftime format spec
    πŸ”΅ Coming to adv v.0.15.0

    #advertools #logfile #DataScience #SEO #Python

  7. Day 1 of #100DaysOfCode:

    Added the ability to supply request headers while fetching sitemaps with #advertools
    (available in the next release)

    This can help in changing the User-agent for example. It can also be used to only fetch sitemaps that haven't changed, using the If-None-Match header. You can keep a fresh set of sitemaps, check continuously, and only download updated ones.

    You can use any other header of course.

    #DataScience #SEO #XML #Sitemap #Python

  8. Finding internal broken links is much easier than external ones with the crawlytics links function.

    After crawling a website:

    πŸ”΅ Get the link summary table with crawlytics[.]links
    πŸ”΅ Filter the error pages from the crawl table by any status code you want e.g. >=400, != 200, etc.
    πŸ”΅ Merge the two tables
    πŸ”΅ Done

    Here is a notebook if you want to test it out:

    bit.ly/43dKhCB

    #advertools #DataScience #SEO #Python #DigitalAnalytics #DigitalMarketing

  9. External link analysis with the #advertools crawlytics module

    πŸ”΅ Use the links() function to map all links on website (URL, anchor text, nofollow, internal/external)
    πŸ”΅ Count the most linked-to domains
    πŸ”΅ Crawl external links and get status codes
    πŸ”΅ Locate broken external links on the website using their location and anchor text
    πŸ”΅ Enjoy

    Get a copy of the HTML report (includes link to code repo):
    bit.ly/48OowL5

    #DataScience #SEO #Crawling #Python #DigitalAnalytics #DigitalMarketing

  10. Data Science with Python for SEO Course: This Monday!
    Get the full details and join here:

    bit.ly/dsseo-course

    If you have any questions let me know, and if you think others might benefit, please let them know.

    #DataScience #SEO #Python #DigitalMarketing #DigitalAnalytics #advertools #pandas #plotly #DataVisualization

  11. Internal links: How interlinked are the different sections of a website?

    πŸ”΅ Using adv[.]crawlytics[.]links we get a mapping of all links (source -> destination)
    πŸ”΅ Using adv[.]url_to_df we get each component of those links (scheme, domain, path, etc)
    πŸ”΅ Count the combinations of the first directories to get the number of links from/to each section of the website

    What do you think?

    #DataScience #digitalanalytics #Python #advertools #SEO

  12. #GSC analysis report template - 1st version

    Discussed in #advertools office hours tomorrow

    Here's a copy of the current report. Would love any recommendations, issues, suggestions...

    bit.ly/48BApVI

    #GoogleSearchConsole #DigitalAnalytics #SEO #DataScience #Python #advertools Report created using @Posit 's Quarto

  13. The split of topics that The New York Times covered in 2022.

    Interactive HTML chart & code:
    bit.ly/3zSxbNh

    You can check other years and see how/if their publishing has changed.

    I removed the dates from URLs in this case (YYYY/MM/DD) to get a better overview. Note that you can include links* in the chart:
    Links*: more than one
    Links*: using a URL shortener like bit[.]ly
    Links*: containing UTM codes

    #DataScience #DataVisualization #Python #treemap #advertools #adviz #SEO