home.social

#hackernewsanalytics — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #hackernewsanalytics, aggregated by home.social.

  1. More on "UNCLASSIFIED": there are 36,520 of those sites right now. (Despite knowing better I keep diving in and classifying more of them.)

    It's not practical to list all of them. But we can randomly sample. And large-sample statistics start to apply at about n=30, so let's just grab 30 of those sites at random using sort -R | head -30:

       1  sfg.io
    1 extroverteddeveloper.com
    2 letmego.com
    1 thestrad.com
    2 bombmagazine.org
    1 domlaut.com
    1 bootstrap.io
    1 jumpdriveair.com
    2 desmos.com
    1 leo32345.com
    1 echopen.org
    1 schd.ws
    1 web3us.com
    7 akkartik.name
    1 bcardarella.com
    1 cancerletter.com
    1 platinumgames.com
    1 industrytap.com
    2 worldoftea.org
    1 motion.ai
    1 vectorly.io
    2 enterprise.google.com
    1 lift-heavy.com
    1 davidpeter.me
    1 panoye.com
    3 thestrategybridge.org
    2 fontsquirrel.com
    1 kettunen.io
    1 moogfoundation.org
    2 elekslabs.com

    That's a few foundations, a few blogs, a corporate site (enterprise.google.com), and something about tea, all with a small number of posts (1--7).

    I'm looking at some slightly larger samples (60--100) here on my own system, and can actually make some comparisons across samples (to see how much variance there is) which can give some more information on tuning what I would expect to find under the "UNCLASSIFIED" sites.

    Which is one way of using #StatisticalMethods to make estimates where direct measurement or assessment is impractical.

    #HackerNewsAnalytics #HackerNews #MediaAnalysis #RandomSampling #Statistics

  2. OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

    Full breakdown:

       4 20
    14 19
    13 18
    23 17
    32 16
    37 15
    48 14
    55 13
    96 12
    120 11
    122 10
    168 9
    247 8
    315 7
    396 6
    622 5
    1052 4
    2016 3
    5103 2
    26494 1

    A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:

      35 20
    27 19
    47 18
    31 17
    33 16
    41 15
    51 14
    45 13
    42 12
    29 11
    46 10
    46 9
    47 8
    91 7
    138 6
    178 5
    269 4
    524 3
    1624 2
    11472 1

    I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

    Now to try to turn this into an analysis over time.

    I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

    To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

    #HackerNewsAnalytics #HackerNews #gawk #awk #DataAnalysis #MediaAnalysis

  3. Oh, and something that would be really useful would be a quick way of looking up a website and getting a rough classification as to what type of content it presents.

    Wikipedia can offer some of this, occasionally sources such as Crunchbase, though the first is hard to parse.

    The Alexa Crawl (Amazon, originally by Brewster Kahle of the Internet Archive) used to offer this as well, though I think that's no longer active.

    If anyone knows of other / better sources, I'd love to know.

    #DearMastomind #DearHivemind #HackerNewsAnalytics

  4. With my HN FP archive updated through yesterday, as one does, updated occurrences of "Reddit" in front-page story titles:

      2007 41
    2008 31
    2009 15
    2010 44
    2011 41
    2012 46
    2013 28
    2014 27
    2015 27
    2016 19
    2017 15
    2018 15
    2019 12
    2020 24
    2021 12
    2022 13
    2023 28

    And what's the occurrence by month in 2023, you ask? Why, I'll tell you:

      1 1
    2 1
    3 0
    4 1
    5 3
    6 22

    And those 22 stories in the first half of June are ... not positive:

    1. Teddit – An alternative Reddit front-end focused on privacy
    2. [dupe] Third-party Reddit apps are being crushed by price increases
    3. Demo: Fully P2P and open source Reddit alternative
    4. Reddit’s plan to kill third-party apps sparks widespread protests
    5. Reddit's Recently Announced API Changes, and the future of /r/blind
    6. Redditor creates working anime QR codes using Stable Diffusion
    7. ArchiveTeam has saved over 11.2B Reddit links
    8. Archive your Reddit data before it's too late
    9. Reddit Strike Has Started
    10. Thousands of subreddits pledge to go dark after the Reddit CEO’s recent remarks
    11. Show HN: Non.io, a Reddit-like platform Ive been working on for the last 4 years
    12. Did Reddit just destroy mobile browser access?
    13. Reddit.com appears to be having an outage
    14. Show HN: Zsync, a Reddit Alternative with the Goal to Reward Quality Comments
    15. Apollo’s Christian Selig explains his fight with Reddit – and why users revolted
    16. The Reddit blackout will continue
    17. The Reddit blackout has left Google barren and full of holes
    18. Reddit’s blackout protest is set to continue indefinitely
    19. Reddit Threatens to Remove Moderators from Subreddits Continuing Blackouts
    20. Reddit is removing moderators that protest by taking their communities private
    21. Louis Rossmann calls community to leave Reddit
    22. Reddit App – Suspicious high number of recent 5 star, one word reviews

    #HackerNews #HackerNewsAnalytics #Reddit #RedditStrike #RedditBlackout

  5. Given the #RedditStrike / #RedditBlackout, question popped up on Hacker News as to whether or not stories critical of Reddit were being overwhelmingly flagged.

    So I updated my Front Page archive through 2023-06-13, and looked at the numbers.

    There've been 16 front-page stories since 31 May 2023 when the first story on API pricing broke.

    That compares against total mentions of Reddit since 2007:

      2007 41
    2008 31
    2009 15
    2010 44
    2011 41
    2012 46
    2013 28
    2014 27
    2015 27
    2016 19
    2017 15
    2018 15
    2019 12
    2020 24
    2021 12
    2022 13
    2023 21

    Note that we're only 45% of the way through 2023, so at the rate of stories-to-date for the year (and ignoring the blow-up in the past two weeks which itself is well-above trend), 2023 is on track for 46 FP stories, which ties the high-water mark set in 2012.

    #HackerNews #HackerNewsAnalytics #MediaAnalysis