#hackernewsanalytics — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #hackernewsanalytics, aggregated by home.social.
-
More on "UNCLASSIFIED": there are 36,520 of those sites right now. (Despite knowing better I keep diving in and classifying more of them.)
It's not practical to list all of them. But we can randomly sample. And large-sample statistics start to apply at about n=30, so let's just grab 30 of those sites at random using
sort -R | head -30:1 sfg.io
1 extroverteddeveloper.com
2 letmego.com
1 thestrad.com
2 bombmagazine.org
1 domlaut.com
1 bootstrap.io
1 jumpdriveair.com
2 desmos.com
1 leo32345.com
1 echopen.org
1 schd.ws
1 web3us.com
7 akkartik.name
1 bcardarella.com
1 cancerletter.com
1 platinumgames.com
1 industrytap.com
2 worldoftea.org
1 motion.ai
1 vectorly.io
2 enterprise.google.com
1 lift-heavy.com
1 davidpeter.me
1 panoye.com
3 thestrategybridge.org
2 fontsquirrel.com
1 kettunen.io
1 moogfoundation.org
2 elekslabs.comThat's a few foundations, a few blogs, a corporate site (enterprise.google.com), and something about tea, all with a small number of posts (1--7).
I'm looking at some slightly larger samples (60--100) here on my own system, and can actually make some comparisons across samples (to see how much variance there is) which can give some more information on tuning what I would expect to find under the "UNCLASSIFIED" sites.
Which is one way of using #StatisticalMethods to make estimates where direct measurement or assessment is impractical.
#HackerNewsAnalytics #HackerNews #MediaAnalysis #RandomSampling #Statistics
-
OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.
Full breakdown:
4 20
14 19
13 18
23 17
32 16
37 15
48 14
55 13
96 12
120 11
122 10
168 9
247 8
315 7
396 6
622 5
1052 4
2016 3
5103 2
26494 1A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:
35 20
27 19
47 18
31 17
33 16
41 15
51 14
45 13
42 12
29 11
46 10
46 9
47 8
91 7
138 6
178 5
269 4
524 3
1624 2
11472 1I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.
Now to try to turn this into an analysis over time.
I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).
To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.
#HackerNewsAnalytics #HackerNews #gawk #awk #DataAnalysis #MediaAnalysis
-
Oh, and something that would be really useful would be a quick way of looking up a website and getting a rough classification as to what type of content it presents.
Wikipedia can offer some of this, occasionally sources such as Crunchbase, though the first is hard to parse.
The Alexa Crawl (Amazon, originally by Brewster Kahle of the Internet Archive) used to offer this as well, though I think that's no longer active.
If anyone knows of other / better sources, I'd love to know.
-
With my HN FP archive updated through yesterday, as one does, updated occurrences of "Reddit" in front-page story titles:
2007 41
2008 31
2009 15
2010 44
2011 41
2012 46
2013 28
2014 27
2015 27
2016 19
2017 15
2018 15
2019 12
2020 24
2021 12
2022 13
2023 28And what's the occurrence by month in 2023, you ask? Why, I'll tell you:
1 1
2 1
3 0
4 1
5 3
6 22And those 22 stories in the first half of June are ... not positive:
- Teddit – An alternative Reddit front-end focused on privacy
- [dupe] Third-party Reddit apps are being crushed by price increases
- Demo: Fully P2P and open source Reddit alternative
- Reddit’s plan to kill third-party apps sparks widespread protests
- Reddit's Recently Announced API Changes, and the future of /r/blind
- Redditor creates working anime QR codes using Stable Diffusion
- ArchiveTeam has saved over 11.2B Reddit links
- Archive your Reddit data before it's too late
- Reddit Strike Has Started
- Thousands of subreddits pledge to go dark after the Reddit CEO’s recent remarks
- Show HN: Non.io, a Reddit-like platform Ive been working on for the last 4 years
- Did Reddit just destroy mobile browser access?
- Reddit.com appears to be having an outage
- Show HN: Zsync, a Reddit Alternative with the Goal to Reward Quality Comments
- Apollo’s Christian Selig explains his fight with Reddit – and why users revolted
- The Reddit blackout will continue
- The Reddit blackout has left Google barren and full of holes
- Reddit’s blackout protest is set to continue indefinitely
- Reddit Threatens to Remove Moderators from Subreddits Continuing Blackouts
- Reddit is removing moderators that protest by taking their communities private
- Louis Rossmann calls community to leave Reddit
- Reddit App – Suspicious high number of recent 5 star, one word reviews
#HackerNews #HackerNewsAnalytics #Reddit #RedditStrike #RedditBlackout
-
Given the #RedditStrike / #RedditBlackout, question popped up on Hacker News as to whether or not stories critical of Reddit were being overwhelmingly flagged.
So I updated my Front Page archive through 2023-06-13, and looked at the numbers.
There've been 16 front-page stories since 31 May 2023 when the first story on API pricing broke.
That compares against total mentions of Reddit since 2007:
2007 41
2008 31
2009 15
2010 44
2011 41
2012 46
2013 28
2014 27
2015 27
2016 19
2017 15
2018 15
2019 12
2020 24
2021 12
2022 13
2023 21Note that we're only 45% of the way through 2023, so at the rate of stories-to-date for the year (and ignoring the blow-up in the past two weeks which itself is well-above trend), 2023 is on track for 46 FP stories, which ties the high-water mark set in 2012.