home.social

#hackernewsanalytics — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #hackernewsanalytics, aggregated by home.social.

  1. @PixelJones What I'd really like to do is to get away from the per-item / per-use payment model, and instead think of the infostream as a distribution utility for which access rather than use is the principle consideration, and in which payment ability (wealth & income) rather than content value is the basis on which payments are made.

    The questions of both what content is made available and how that content is compensated I'm leaving somewhat vague, though in general we have systems which work for this, and which have worked for nearly a century now based on broadcast & cable media, audit-based measurement (Nielson, Aribitron, etc.), distributor-based negotiations (with individual broadcast stations or networks), and something closely approaching a common-carrier model for the actual access providers (that is, ISPs).

    The points @dangillmor raised are valid: a gatekeeper monopoly is a critical hazard, and is worth addressing from a competitiveness standpoint, independent of this proposal.

    Why "all you can eat"? Two principle reasons:

    1: Need for information is strongly independent of capacity to pay, and often inversely associated.

    2: There are entirely novel capabilities afforded by access at scale which a usage-based payment model largely forecloses on. Aaron Swartz's work which lead to his prosecution and suicide based on wholesale downloading of JSTOR scientific papers is a key case in point. It's possible to look through, over, and among a corpus to find relationships not otherwise manifest. (I'm doing something along these lines with my #HackerNewsAnalytics series posted here on the Fediverse.)

    The notion of an individual or household account, associated with personal mobile devices and/or household Internet service, from which pro-rata payments are then allocated amongst various providers is one option for compensation, though even that might well not be ideal. That imposes a huge surveillance component itself (who is reading, listening to, or watching what), and could well disproportionately benefit or starve less substantial or more substantial works. More critically on that last: works which are far more expensive to produce at quality, such as investigative journalism or scientific research.

    Some sense of local, regional, national, and global providers / publishers, within genres, funded with a specific budget and for a minimum guaranteed time period, would provide the institutional stability to provide certain classes of work: news, education, business and government publications, academic research, and of course, entertainment.

    And, again, multiple revenue streams, including premium subscriptions, patronage, advertising, etc., could well be additional components. But an access-based automatic and universally-billed tier really does seem to be a possibility that's rarely mentioned or advocated.

    @cobalt

    #UniversalContentSyndication #PayingForJournalism

  2. On general discussion forums and "paying for media"

    One frequent dispute online is over paywalled links, and the general advisability on various grounds of sharing workarounds. I happen to have data for Hacker News (HN), so that's what I discuss here.

    As I'm sitting on a trove of ~190k front page stories and the sites linked by them, I can bring some insight to this debate. As of 21 June 2023, there were 52,642 distinct sites which have made just the front page (30 items/day). That's roughly 3% of all submitted posts, which would be a rather larger site tally.

    How many of those 52,642 sites should HN members subscribe to?

    If we restrict that to only the sites with 100+ front-page submissions, that number falls to 149. Still, arguably, excessive.

    Of the sites I've identified as "general news" (all sites w/ >= 17 appearances, plus a few others), that list is 146.

    Those constitute 8.47% of all HN front-page posts, the second-largest overall category following blogs.

    I would suggest that expecting the 600k+ active HN participants, let alone the 5 million or so total monthly users, to individually subscribe to more than a very small handful of such sites is entirely unrealistic.

    Subscriptions are a concept which worked reasonably well for local newspapers serving limited areas for which some fraction of households might subscribe to one, and far fewer multiple dailies. The majority of expenses were covered by advertising, however.

    Whatever business model people are going to suggest for online media, it's going to have to address the fact that individual people cannot and will not register many thousands, or even dozens, of subscriptions.

    (Adapted from an earlier HN comment: news.ycombinator.com/item?id=3)

    Edits: Rephrasing.

    #HackerNews #HackerNewsAnalytics #Paywalls #Subscriptions #Journalism

  3. HackerNews changed how it dealt with highly-active discussions around January 2009, based on evidence I see (far fewer spicy threads after that date).

    I'm also seeing that spicy stories actually tend to rank slightly higher on the page (a lower "storypos", that is, story position, value), which is counter to my expectation. This may of course be due to selection bias --- moderators specifically lift limit on overheated stories, so that those stories that do survive are more appropriate to HN.

    I'd like to look at semantic / sentiment elements here as well, words or phrases which seem more prevalent on high-ratio stories. Here my analytic methods work against me as the HN title of a post is often quite short and not especially descriptive, though with some examples (as with the mental health study mentioned earlier).

    #HackerNews #HackerNewsAnalytics

  4. Hacker News "Ratio": political commentary sites

    Continuing my look at the comments/votes ratio, a look at sites which tend to focus on political commentary and their "spiciness". These tend to be well above mean (0.63), median (0.52), and tend to be a standard deviation or more from the mean (1 sd: 0.78, 2 sd: 0.92, 3 sd: 1.06).

    Stories Vote    Comm   Ratio  Site         
    2 18 57 3.167 heritage.org
    4 143 224 1.566 hoover.org
    9 473 603 1.275 breitbart.com
    8 1724 1873 1.086 cityobservatory.org
    9 364 379 1.041 mises.org
    1 56 55 0.982 adamsmith.org
    7 2488 2372 0.953 city-journal.org
    1 92 85 0.924 manhattan-institute.org
    70 13143 11614 0.884 reason.com
    5 854 722 0.845 jacobinmag.com
    1 204 153 0.750 theblaze.com
    13 1607 1202 0.748 bostonreview.net
    5 1682 1252 0.744 tribunemag.co.uk
    4 629 465 0.739 nationaljournal.com
    5 1907 1400 0.734 americanaffairsjournal.or
    12 2164 1584 0.732 alternet.org
    10 1302 871 0.669 cato.org
    5 738 493 0.668 dailycaller.com
    9 1387 844 0.609 dailykos.com
    5 759 450 0.593 rawstory.com
    10 2538 1455 0.573 rootsofprogress.org
    2 552 275 0.498 theroot.com
    30 7881 3850 0.489 rt.com
    2 1256 467 0.372 wsws.org

    Note that general news tends somewhat toward spicy, though not as much as the explicitly political sites. Of the 147 sites I'd identified as "general news", ratio statistics are:

    n: 147, sum: 94.415, min: 0.092, max: comms,, mean: 0.642279, median: 0.605, sd: 0.433165

    %-ile:

    5: 0.234, 10: 0.341, 15: 0.4515,
    20: 0.491, 25: 0.51, 30: 0.5305,
    35: 0.5415, 40: 0.566, 45: 0.581,
    55: 0.614, 60: 0.6285, 65: 0.654,
    70: 0.68, 75: 0.716, 80: 0.734,
    85: 0.7875, 90: 0.8715, 95: 1.1925

    (As with other toots in this series, Markdown formatting is used, toot.cat may be better than your own instance's presentation.)

    #HackerNews #HackerNewsAnalytics

  5. The 20 "spiciest" sites seem to be (using a cut-off of 20+ stories):

    apnews.com                     36      14674      17512     1.193
    sfchronicle.com 25 5771 6174 1.070
    variety.com 24 5479 4992 0.911
    mattmaroon.com 73 3332 3023 0.907
    axios.com 92 38075 34150 0.897
    bizjournals.com 20 2183 1959 0.897
    cnbc.com 174 59983 53056 0.885
    apple.com 241 99945 88396 0.884
    reason.com 70 13143 11614 0.884
    nypost.com 28 5851 5088 0.870
    markevanstech.com 22 290 251 0.866
    macrumors.com 62 18700 16162 0.864
    nikkei.com 56 17568 15174 0.864
    economist.com 829 119205 102702 0.862
    thewalrus.ca 30 6194 5199 0.839
    techradar.com 30 7227 6053 0.838
    backreaction.blogspot.com 33 7209 5968 0.828
    strongtowns.org 27 8279 6857 0.828
    mondaynote.com 45 7581 6268 0.827
    coindesk.com 22 10236 8355 0.816

    And the 20 least spicy sites are:

    particletree.com               37        997        227     0.228
    brendangregg.com 40 11135 2512 0.226
    intruders.tv 28 324 73 0.225
    aphyr.com 34 8514 1910 0.224
    andrewchen.typepad.com 51 757 168 0.222
    michaelnielsen.org 31 3335 723 0.217
    igvita.com 38 3626 767 0.212
    startuplessonslearned.blo 24 1101 232 0.211
    citusdata.com 51 8361 1717 0.205
    ferd.ca 21 5883 1132 0.192
    ocks.org 27 6036 1120 0.186
    tensorflow.org 22 5612 1020 0.182
    aosabook.org 21 3899 669 0.172
    ocw.mit.edu 41 8793 1500 0.171
    david.weebly.com 20 1364 226 0.166
    jslogan.com 24 97 16 0.165
    burningdoor.com 23 149 23 0.154
    linusakesson.net 26 4531 684 0.151
    github.com/0xax 22 2168 121 0.056

    #HackerNews #HackerNewsAnalytics

  6. The Hacker News Ratio

    One concept Hacker News uses to moderate discussions is a "flamewar detector", which based on moderator comments over the years is triggered when a discussion has > 40 comments AND there are more comments than votes on the article.

    That had long struck me as questionable, but it's now something I can look at and ... it seems reasonably accurate. I've calculated ratios of all 178,882 HN Front Page stories (as of 2023-6-31), and ... do I have some ratios.

    Basic stats:
    n: 178882, sum: 89796.9, min: 0.00, max: 21.00, mean: 0.501990, median: 0.4, sd: 0.432899

    Percentiles:
    %-ile: 5: 0.08, 10: 0.13, 15: 0.17, 20: 0.21, 25: 0.24, 30: 0.27, 35: 0.3, 40: 0.33, 45: 0.37, 55: 0.44, 60: 0.48, 65: 0.53, 70: 0.58, 75: 0.64, 80: 0.72, 85: 0.82, 90: 0.96, 95: 1.22

    Because of how I've parsed and processed data, it's not entirely straightforward to pull up the specific posts, though I can find those by the date and story position (ranked 1--30 on the page).

    And ... yeah, the stories that tend to rate high based on this metric do tend to be sort of flamey.

    The most ratioed post of all time was "juwo beta is released (at last!) Please use it and help improve it!", from 18 April 2007, at 21.0:

    news.ycombinator.com/item?id=1

    Sometime around 2009--2010 the flamewar detector seems to have been implemented and ratios tend to be much lower, though there are still some pretty spicy discussions. One from the National Institutes on Health titled "Mental illness, mass shooting,s and the politics of American firearms", posted on 26 May 2022 (for a story originally dating from 2015) is the highest-ratioed post after the flamewar detector came into use, at 5.99:

    news.ycombinator.com/item?id=3

    I find it interesting how being able to query my archive affords insights on HN which aren't available through the standard search tools. It's possible to look for specific keywords, or submissions or comments from a specific account, but searching for contentious posts isn't really A Thing.

    I'm doing some further digging to see what patterns might emerge by site, though finding a good minimum number of front-page appearances is one question I'm looking at.

    #HackerNews #HackerNewsAnalytics

  7. More on "UNCLASSIFIED": there are 36,520 of those sites right now. (Despite knowing better I keep diving in and classifying more of them.)

    It's not practical to list all of them. But we can randomly sample. And large-sample statistics start to apply at about n=30, so let's just grab 30 of those sites at random using sort -R | head -30:

       1  sfg.io
    1 extroverteddeveloper.com
    2 letmego.com
    1 thestrad.com
    2 bombmagazine.org
    1 domlaut.com
    1 bootstrap.io
    1 jumpdriveair.com
    2 desmos.com
    1 leo32345.com
    1 echopen.org
    1 schd.ws
    1 web3us.com
    7 akkartik.name
    1 bcardarella.com
    1 cancerletter.com
    1 platinumgames.com
    1 industrytap.com
    2 worldoftea.org
    1 motion.ai
    1 vectorly.io
    2 enterprise.google.com
    1 lift-heavy.com
    1 davidpeter.me
    1 panoye.com
    3 thestrategybridge.org
    2 fontsquirrel.com
    1 kettunen.io
    1 moogfoundation.org
    2 elekslabs.com

    That's a few foundations, a few blogs, a corporate site (enterprise.google.com), and something about tea, all with a small number of posts (1--7).

    I'm looking at some slightly larger samples (60--100) here on my own system, and can actually make some comparisons across samples (to see how much variance there is) which can give some more information on tuning what I would expect to find under the "UNCLASSIFIED" sites.

    Which is one way of using #StatisticalMethods to make estimates where direct measurement or assessment is impractical.

    #HackerNewsAnalytics #HackerNews #MediaAnalysis #RandomSampling #Statistics

  8. So ... I'm starting to get the reporting by site classification across years down and ... it is interesting.

    Preliminary and buggy code yet. Also this is highly dependent on how I've actually classified sites.

    I've got a few classifications I'd wanted to keep an eye on:

    • Programming-specific sites. A lot of this is github and gitlab, basically, software projects with code. I'm distinguishing software (which is mostly about use) and programming which involves, or at least anticipates, actual development.

    • "Political commentary". I used this as a description for ... highly political sites (spot-checking to see what stories actually hit the front page, though I should be more robust in that). The list: reason.com, rt.com, bostonreview.net, alternet.org, cato.org, rootsofprogress.org, breitbart.com, dailykos.com, mises.org, dailycaller.com, jacobinmag.com, rawstory.com, tribunemag.co.uk, hoover.org, heritage.org, theroot.com, wsws.org, adamsmith.org, manhattan-institute.org, theblaze.com.

    And there's "academic / science" which is mostly university and academic press / journal sites.

    Anywho....

    ... at least from initial takes, the trending on these does not suggest a trending toward sensationalistic topics and/or sites, but the opposite. Much more programming FP stories in recent years, fewer political commentary, and more academic/science items.

    Presuming this holds up as I code further.

    This is one of the fun things about data analysis: stuff jumps out at you, sometimes confirming hunches, but often radically violating preconceptions.

    I want to look more closely at what happens in the lead-up and follow-on to the 2016 US elections cycle in particular....

    Hrm. What does spike is cryptocurrency-specific sites in 2014. Though that falls off again. (I suspect as that discussion enters more mainstream sources.)

    And "general info" and "general interest" sites seem to rise in recent years.

    #HackerNewsAnalytics #HackerNews #MediaAnalysis

  9. OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

    Full breakdown:

       4 20
    14 19
    13 18
    23 17
    32 16
    37 15
    48 14
    55 13
    96 12
    120 11
    122 10
    168 9
    247 8
    315 7
    396 6
    622 5
    1052 4
    2016 3
    5103 2
    26494 1

    A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:

      35 20
    27 19
    47 18
    31 17
    33 16
    41 15
    51 14
    45 13
    42 12
    29 11
    46 10
    46 9
    47 8
    91 7
    138 6
    178 5
    269 4
    524 3
    1624 2
    11472 1

    I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

    Now to try to turn this into an analysis over time.

    I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

    To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

    #HackerNewsAnalytics #HackerNews #gawk #awk #DataAnalysis #MediaAnalysis

  10. Oh, and something that would be really useful would be a quick way of looking up a website and getting a rough classification as to what type of content it presents.

    Wikipedia can offer some of this, occasionally sources such as Crunchbase, though the first is hard to parse.

    The Alexa Crawl (Amazon, originally by Brewster Kahle of the Internet Archive) used to offer this as well, though I think that's no longer active.

    If anyone knows of other / better sources, I'd love to know.

    #DearMastomind #DearHivemind #HackerNewsAnalytics

  11. I've got this to about 60% of posts classified (by submitted site). I can continue winnowing this down, though there's obviously diminishing returns.

    I've also revised my analysis code so that anything that's not classified defaults to "UNCLASSIFIED", without having to explicitly code that in the sites file.

    I'm thinking of how I might crossref / correlate the site-based findings with title-based analysis. I'm also thinking of looking at average comments / votes by classification, as well as looking at the ratio of comments to votes (HN uses this as a very rough "flamewar" heuristic, though on somewhat shaky grounds IMO).

    My sense is that many of the less-frequently-posted sites will turn out to be blogs of some form. I'm thinking of how I might assess this without having to key all of them.

    <stage_whisper> random sampling <\stage_whisper>

    One issue issue for less-frequently-occuring sites is that it's easy to code those which match a pattern (twitter, blogspot, livejournal, medium, substack, etc.) than those which are idiosyncratic. Note that a lot of Medium blogs don't appear on Medium domains, as well.

    #HackerNewsAnalytics #HackerNews #MediaAnalyhsis

  12. I'm continuing to play with this, and have classified a whole mess more sites (reminder to self: update that count) (response to self: 13,150 sites classified).

    So that's about 25% of all sites that are classified. Looking by story count ... it's about 55% of all FP stories. (Power laws are your friend here...)

    Looking at my current breakdowns (and again, this is all VERY ROUGH):

         1   15770  8.82%  blog
    2 15034 8.40% general news
    3 13899 7.77% software
    4 12889 7.21% tech news
    5 7960 4.45% academic / science
    6 7294 4.08% n/a
    7 6025 3.37% corporate comm.
    8 4859 2.72% business news
    9 2120 1.19% social media
    10 2031 1.14% general interest
    11 1557 0.87% general magazine
    12 1397 0.78% general information
    13 1239 0.69% technology
    14 1099 0.61% videos
    15 975 0.55% government
    16 607 0.34% ???
    17 559 0.31% tech discussion
    18 505 0.28% tech law
    19 497 0.28% misc documents
    20 420 0.23% science news
    21 316 0.18% mailing list
    22 251 0.14% tech publications
    23 171 0.10% tech blog
    24 149 0.08% literature
    25 136 0.08% business education
    26 133 0.07% cryptocurrency
    27 126 0.07% law
    28 118 0.07% webcomic
    29 109 0.06% entertainment news
    30 103 0.06% health news
    31 103 0.06% video
    32 96 0.05% general discussion
    33 80 0.04% misc
    34 71 0.04% technology / security
    35 49 0.03% translation
    36 47 0.03% images
    37 46 0.03% podcast
    38 42 0.02% journalism
    39 30 0.02% propaganda
    40 29 0.02% healthcare / medicine
    41 18 0.01% medicine
    42 7 0.00% legal news

    Classified: 98966
    Unclassified: 79916
    Total: 178882
    Ratio: 0.553

    My classifications are rough and I may revisit these. "blog" covers a lot of sins, though most are tech blogs (which makes "technology blog" redundant).

    What I'd really like to do is to look at how trends vary over the years. Perhaps also by day of week / month of year. Finally answer that age-old question of whether HN is turning into Reddit....

    As noted above, this is based on classifying the site rather than interpreting the title or reading the source article, so it's all a bit wobbly.

    (This post formats better on toot.cat or on sites that render Markdown.)

    #HackerNewsAnalytics #HackerNews #MediaAnalysis

  13. gagejustins's HN analysis has inspired me to take a crack at typifying Hacker News front page stories by type.

    Whilst he'd manually assessed each front-page story, I'm classifying by site, so that an NY Times article on, say, quantum computing would still be described as "general news".

    I've classified 10,200 of 52,642 domains, the first 300 or so manually, much of the rest using regexes and imputation (e.g., ".edu", ".gov", and sites on Blogspot, Substack, Medium, etc.).

    Results by story count:

         1  13782  general news
    2 13398 software
    3 10473 tech news
    4 8677 blog
    5 7651 academic / science
    6 7294 n/a
    7 4750 ???
    8 4600 business news
    9 3546 corporate comm.
    10 1504 general magazine
    11 1291 general information
    12 1162 general interest
    13 1132 technology
    14 1099 videos
    15 1073 social media
    16 975 government
    17 568 corporate comm
    18 559 tech discussion
    19 505 tech law
    20 251 tech publications
    21 171 tech blog
    22 170 science news
    23 136 business education
    24 104 corporate comm.
    25 103 video
    26 99 corporate commm.
    27 96 general discussion
    28 80 misc
    29 71 technology / security
    30 61 law
    31 59 webcomic
    32 49 translation
    33 48 health news
    34 47 images
    35 46 podcast
    36 32 law
    37 7 legal news

    Unclassified: 93213

    "n/a" indicates no site, e.g., an Ask, Tell, or Show HN post.

    '???' indicates I couldn't (quickly) assess a domain. Examples: 37signals.com, readwriteweb.com, thenextweb.com, archive.org, anandtech.com, avc.com, docs.google.com, righto.com, slideshare.net, infoq.com, hackaday.com, gamasutra.com, marco.org, smashingmagazine.com, highscalability.com, catonmat.net, centernetworks.com, jvns.ca, scribd.com, about.gitlab.com, cloud.google.com, alleyinsider.com, msn.com, firstround.com, axios.com, openculture.com, onstartups.com, ejohn.org, dadgum.com, shkspr.mobi, mixergy.com, geek.com, gmane.org, foundread.com.

    "cproorate commm." is an obvious typo. This is very rough code & classification.

    #HackerNewsAnalytics #MediaAnalysis #HackerNews
  14. I have Found My People: "What gets to the front page of Hacker News? A data project"

    Some marketing dude is also looking at the HN front page. We're comparing notes ...

    randomshit.dev/posts/what-gets

    news.ycombinator.com/item?id=3

    #HackerNewsAnalytics

  15. With my HN FP archive updated through yesterday, as one does, updated occurrences of "Reddit" in front-page story titles:

      2007 41
    2008 31
    2009 15
    2010 44
    2011 41
    2012 46
    2013 28
    2014 27
    2015 27
    2016 19
    2017 15
    2018 15
    2019 12
    2020 24
    2021 12
    2022 13
    2023 28

    And what's the occurrence by month in 2023, you ask? Why, I'll tell you:

      1 1
    2 1
    3 0
    4 1
    5 3
    6 22

    And those 22 stories in the first half of June are ... not positive:

    1. Teddit – An alternative Reddit front-end focused on privacy
    2. [dupe] Third-party Reddit apps are being crushed by price increases
    3. Demo: Fully P2P and open source Reddit alternative
    4. Reddit’s plan to kill third-party apps sparks widespread protests
    5. Reddit's Recently Announced API Changes, and the future of /r/blind
    6. Redditor creates working anime QR codes using Stable Diffusion
    7. ArchiveTeam has saved over 11.2B Reddit links
    8. Archive your Reddit data before it's too late
    9. Reddit Strike Has Started
    10. Thousands of subreddits pledge to go dark after the Reddit CEO’s recent remarks
    11. Show HN: Non.io, a Reddit-like platform Ive been working on for the last 4 years
    12. Did Reddit just destroy mobile browser access?
    13. Reddit.com appears to be having an outage
    14. Show HN: Zsync, a Reddit Alternative with the Goal to Reward Quality Comments
    15. Apollo’s Christian Selig explains his fight with Reddit – and why users revolted
    16. The Reddit blackout will continue
    17. The Reddit blackout has left Google barren and full of holes
    18. Reddit’s blackout protest is set to continue indefinitely
    19. Reddit Threatens to Remove Moderators from Subreddits Continuing Blackouts
    20. Reddit is removing moderators that protest by taking their communities private
    21. Louis Rossmann calls community to leave Reddit
    22. Reddit App – Suspicious high number of recent 5 star, one word reviews

    #HackerNews #HackerNewsAnalytics #Reddit #RedditStrike #RedditBlackout

  16. Given the #RedditStrike / #RedditBlackout, question popped up on Hacker News as to whether or not stories critical of Reddit were being overwhelmingly flagged.

    So I updated my Front Page archive through 2023-06-13, and looked at the numbers.

    There've been 16 front-page stories since 31 May 2023 when the first story on API pricing broke.

    That compares against total mentions of Reddit since 2007:

      2007 41
    2008 31
    2009 15
    2010 44
    2011 41
    2012 46
    2013 28
    2014 27
    2015 27
    2016 19
    2017 15
    2018 15
    2019 12
    2020 24
    2021 12
    2022 13
    2023 21

    Note that we're only 45% of the way through 2023, so at the rate of stories-to-date for the year (and ignoring the blow-up in the past two weeks which itself is well-above trend), 2023 is on track for 46 FP stories, which ties the high-water mark set in 2012.

    #HackerNews #HackerNewsAnalytics #MediaAnalysis

  17. So ... I'm playing with a report showing how often F500 companies are mentioned in HN submission titles.

    As I've noted, most of my scripting is in awk (gawk), and it's ... usually pretty good.

    I'm toying with a couple of loops where I read all 178k titles, and all 500 company names, into arrays, then check to see if the one appears in the other.

    The first iteration of that was based on the index() function, which is a simple string match. Problem is that there are substring matches, for example "Lear" (the company) will match on "Learn", "Learning", etc., and so is strongly overrepresented.

    So I swapped in match(), which is a regular-expression match, and added \W as word-boundaries.

    The index-based search ran in about 20 seconds. That's a brief wait, but doable.

    The match (regex) based search ... just finished as I'm writing this. 13 minutes 40 seconds.

    Regexes are useful, but can be awfully slow.

    Which means that my first go at this --- still using gawk but having it generate grep searches and printing the match count only ... is much faster whilst being accurate. That runs in just under a minute here. I'd looked for another solution as awk is "dumb" re the actually output: it doesn't read or capture the actual counts, so I'll either have to tweak that program or feed its output to an additional parser. Neither of which is a big deal, mind.

    Oh, and Apple seems to be the most-mentioned company, though the F500 list omits Google (or YouTube, or Android), listing only Alphabet, which probably results in a severe undercount.

    Top 10 using the F100 list:

         1  Apple:  2447
    2 Microsoft: 1517
    3 Amazon: 1457
    4 Intel: 554
    5 Tesla: 404
    6 Netflix: 322
    7 IBM: 309
    8 Adobe: 180
    9 Oracle: 167
    10 AT&T: 143

    Add to those:

    $ egrep -wc '(Google|Alphabet|You[Tt]ube|Android)' hn-titles
    7163
    egrep -wc '(Apple|iPhone|iPad|iPod|Mac[Bb]ook)' hn-titles
    3656
    egrep -wc '(Facebook|Instagram)' hn-titles
    2512

    Note I didn't even try "Meta", though let's take a quick look ... yeah, that's a mess.

    Up until 2021-10-28, "Meta" is a concept, with 33 entries. That was the day Facebook announced its name change. 82 total matches (so low overall compared to the earlier numbers above), 49 post-announcement, of which two are not related to Facebook a/k/a Meta. Several of the titles mention both FB & Meta ... looks like that's four of 'em.

    So "Meta" boosts FB's count by 45.

    There are another 296 mentions of Steve Jobs and Tim Cook which don't also include "Apple".

    And "Alphabet" has 54 matches, six of which don't relate to the company.

    Of the MFAANG companies:

    Google: 5796
    Apple: 2447
    Facebook: 2371
    Microsoft: 1517
    Amazon: 1457
    Netflix: 322

    (Based on grep.)

    #DataAnalysis #awk #grep #bash #HackerNewsAnalytics

  18. In fact-checking my own comment, I found that my success rate in reaching the HN front page is not the roughly 10% I'd thought.

    It's pretty much spang on 3%, which is the overall site average.

    That's based on my archive's count of my own FP submissions (60) and Algolia search's results for all my article submission, whether or not they hit the front page (1,974).

    So I guess I'm just about average.

    This gives me the idea of checking against the HN Leaders list to see if anyone's markedly above 3% for FP placements.

    #HackerNewsAnalytics #HackerNews

  19. I was able to draw on my HN FP archive to respond in part to concerns over topic suppression by an HN member:

    news.ycombinator.com/item?id=3

    This is an interesting superpower ...

    Not an awesome, superpower, mind, but an interesting one.

    #HackerNews #HackerNewsAnalytics

  20. Hacker News characteristics --- banned sites (2009)

    I've been crawling through some of the early discussions about HN's design, intent, and characteristics.

    One interesting item is a list of 2,096 banned sites from 2009:

    news.ycombinator.com/item?id=4

    There's also Paul Graham's "What I've Learned from Hacker News" (2009):
    paulgraham.com/hackernews.html

    Edit: *Markdown*

    #HackerNews #HackerNewsAnalytics

  21. Hacker News Front Page: Story votes and comments by page position

    I'd been wanting to show this somehow, and it's a bit of a fat table, but here are the mean votes and comment counts (rounded to integer value) for the 1st through 3rd, then 5th, 10th, 15th, 20th, 25th, and 30th stories on the Hacker News front page:

    pastebin.com/raw/KvhYfBdB

    (Expires in a week.)

    For 2022, last full-year of data, by position:

    Pos:  Votes / Comments
    1: 1005 / 450
    2: 704 / 380
    3: 582 / 322
    5: 460 / 279
    10: 301 / 176
    15: 235 / 145
    20: 192 / 130
    25: 170 / 110
    30: 158 / 108

    That's a 6.36x advantage in votes, and 4.17x advantage in comments from the 1st to 30th story position.

    #HackerNewsAnalytics #HackerNews #MediaAnalysis

  22. Hacker News Analytics: ~3% of submissions reach front page, with half of comments on FP articles

    This is a finding based on maths and a previous study by Whaly in 2022 based on HN 2021 activity, rather than my own crawl, though it's informed by the latter.

    whaly.io/posts/hacker-news-202

    The HN front page is a limited resource --- there are 365 * 30 == 10,950 front-page slots in a year, another 30, or 10,980, in a leap year, and regardless of site activity over a year, those slots are fixed. It's somewhat of a reminder that regardless of how much information we can access, our time to process that information is finite. Or as Herbert Simon observed: what information consumes is attention.

    Whaly saw 386,663 total story submissions for 2021. I'm pretty sure that this is net of moderation (user flags, auto-kills, spam detection, voting-ring detection and the like). But it works out to a hair under 3% of stories not catching on any of those tripwires which then land on the HN front page.

    Mind that that's actually a somewhat low estimate, as a story may appear for part of the day on the front page but not be represented on the end-of-day front-page archive.

    I'm now thinking of doing some spot checks to see what kinds of success rates individual submitters have in landing on the front page. From what I've seen, even well-known and popular members have at best a modest chance of success.

    Whaly also give a total number of comments: 3,769,520. That I can compare to my own front-page stats for 2021: 1,859,933, or 49.34% of all comments. That is, half of HN comments appear on the 3% of stories which reach the front page. That percentage is lower than what I'd have expected, though it's still a very strong bias toward the front page.

    (Now I want to complete another analysis I'd thought of: mean votes and comments by story position (1--30), by year. Hrm...)

    #HackerNewsAnalytics #HackerNews #MediaAnalysis

  23. HN Front Page: Foreign Policy Top 100 Global Thinkers (2014)

    I pulled a copy of the "global thinkers" list I'd used as an indicator of website salience in a 2015 study.

    The HN front page offers a limited opportunity for matches --- titles are 80 characters only, and HN's editorial policy is to not list authors of works, so what will show here is likely a subset of actual mentions.

    That said: nearly a quarter of the list (23 entries) appear, from 1 to 11 times each. Paul Krugman (11), Lawrence Lessig (10), and Richard Dawkins (10) top the list.

         1  Paul Krugman:  11
    2 Lawrence Lessig: 10
    3 Richard Dawkins: 10
    4 Freeman Dyson: 9
    5 Daniel Kahneman: 8
    6 Noam Chomsky: 8
    7 Jaron Lanier: 6
    8 Steven Pinker: 5
    9 Daniel Dennett: 4
    10 Christopher Hitchens: 2
    11 Craig Venter: 2
    12 Edward O. Wilson: 2
    13 Jared Diamond: 2
    14 Richard Posner: 2
    15 Steven Weinberg: 2
    16 Thomas Friedman: 2
    17 Gary Becker: 1
    18 Hernando de Soto: 1
    19 James Lovelock: 1
    20 Larry Summers: 1
    21 Martha Nussbaum: 1
    22 Peter Singer: 1
    23 Salman Rushdie: 1

    Thje 2015 post, "Tracking the Conversation" is here: old.reddit.com/r/dredmorbius/c

    #HackerNews #HackerNewsAnalytics #MediaAnalysis #ForeignPolicy #Top100GlobalThinkers

  24. Hacker News Front Page Analytics ... what next?

    I'm thinking through where else to take this. I've had a few side discussions and commentary here and at HN. Part of this is coming up with questions, part with the tools to answer them.

    The initial question concerned places and regional references found on the HN front page. My initial analysis answered that (US states), and it was pretty easy to add cities (US and global) and countries to the list.

    I also wanted some overall summary statistics, for all time, by year, by period (I've done weekdays, I still need to get to months). There were some interesting comparisons --- vote and comment activity by page position, for example (there's an 844.4 point advantage for 1st over 30th place in votes, 340.3 in comments, for 2022, on average).

    I've broken out overall and average (per-story) votes and comments, which is interesting.

    There's top-site and top-user activity, and how that changes over time. I've done some work on this, I'm thinking of both other questions and how to represent this graphically.

    (Graphical representation is a question for other aspects as well ... what I've created so far is great for people who like reading 100s of pages of tables, less for those who prefer a visual representation.)

    What I've done less of, and am trying to think of ways to surface interesting elements rather than be strictly query/question driven, is to find patterns and trends in the data itself, most especially in the title text. There are challenges: HN doesn't provide much to work with (titles are restricted to 80 characters, generally), and there is of course ambiguity, though I'd posted a set of interesting/amusing items (see: toot.cat/@dredmorbius/11045412).

    I've been playing with some simple ngram code (awk associative arrays of 2..5 elements ... mind-bogglingly easy to create and often surprisingly insightful).

    I've relied on some external lists of entities (states, cities, countries, etc.) which are useful. I'd done an earlier analysis based on the Foreign Policy Top 100 Global Thinkers list, assessing salience level of various online sources, in 2015 (see: old.reddit.com/r/dredmorbius/c). I can re-use that list, though I'd like to find a few others --- top startups / companies / people. Also perhaps major stories and terms from the past two decades. (I've done some searches based on my own recollection, e.g., MeToo, BLM, George Floyd, and the like with some success).

    And I'd like to do a deeper parse of the source HTML to grab both HN threads and source URLs. I've found the html-xml-utils package useful, need to check that's installed locally (OK, seems it is) and wrap my head around it again (the tools are ... idiosyncratic). Oh, and homebrew lists package executables in /usr/local/opt//bin/, which is good to know. Yay!

    (Yes, I'm aware there are other tools. I'm a simple basher.)

    #HackerNews #HackerNewsAnalytics

  25. Hacker News "Leaders" front-page activity

    So, more on that thing I said I wouldn't do but did anyway ...

    Backstory: an offhand question lead me to crawl the HN front-page (FP) archive from 2007-present, just shy 6,000 pages, representing 178,162 stories, 52,400 distinct sites, and 43,491 distinct submitters. Each page has up to 30 stories, such that a fully-populated year has 10,950 or 10,980 (leap year) stories.

    HN also provides a "leaders" page showing the top-100 members and "karma" (overall votes) --- latter being obscured for the top-10 members, though that can be found on their profile page. (news.ycombinator.com/leaders)

    So ... I can get a summary of front-page activity for all leaders. It's ... interesting.

    To assuage my guilt somewhat I'm only reporting overall / summary or anonymised stats. My goal isn't to out anyone specifically, but to give a sense of what HN front-page and "leader" member activity is like.

    Seven leaders have no front-page posts at all, 17 have single-digit counts. The range is from 0 (obv!) to 1,183, mean 175.7, median 129, st.dev. 201.32, 10%ile: 3, 25%ile: 11; 75%ile: 253.5, 90%ile; 493.5.

    Active years (years in which there is nonzero front-page activity) is ... all over the map -- there are members with results over 17 years, and with none at all.

    What's ... peculiar ... is the points/karma% ratios. "Points' are votes on stories, "karma" is supposedly overall points (sum of story + comment moderation, less some for negative votes). The percentage of votes to overall karma ranges from 0 (no front-page activity) ... to 150.94%: more votes than cumulative karma. Points > overall karma (ratio > 100%) happens sixteen times, which is ... odd.

    (Well, I mean, 16 is an even number, but the fact is odd-as-in-strange.)

    One reason I've been doing this is to come up with some sense of overall quality metric. Engagements (votes and comments) are a highly-imperfect indicator, but looking at the arithmetic mean of votes and comments is interesting. I'm looking here at the average over all a member's front-page submissions:

    Votes range from 0 to 634, mean 196.50, median 105,91, st.dev. 101.92, 25%ile: 150k 75%ile: 239.95.

    Comments range from 0 to 323.75, mean 102.06, median: 96.38, 25%ile: 60.67, 75%ile: 123.16.

    As might be expected, several members with lower-than-average submissions see high averages (there's more variance in small-n stats). One of the top-10 submitters (by average points and comments) has 514 FP stories, with an average of 236.37 points and 176.96 comments, and the most prolific submitter is very nearly median by votes and comments.

    It's also possible to look at who's submitting a small or large range of sites by calculating a sites/stories% ratio. I'm finding, for example, one leader with 414 FP stories, from only 30 distinct sites, with the top site representing over half their submissions. (The site in question is legit and interesting, this does not appear to be spammy.) Several appear to favour their own personal sites / blogs, though again, not in a noxious way that I see. And 18 leaders have posted only a single item per site (each post is its own site), ranging from 1 to 20 FP items overall.

    The ratio ranges from 0 (obv!) to 100 (obv!), mean 67.03%, median 71.83%, 25%ile: 51.82%, 75%ile: 89.72%.

    Edit: Words.

    #HackerNewsAnalytics #HackerNews #MediaAnalysis

  26. @tsturm That's already been the trend by year, roughly:

    2008:    2
    2009: 2
    2011: 5
    2012: 4
    2013: 10
    2014: 6
    2015: 19
    2016: 31
    2017: 30
    2018: 26
    2019: 28
    2020: 41
    2021: 37
    2022: 27
    2023: 14

    Or if you like barplots:

    2008: *
    2009: *
    2011: **
    2012: **
    2013: *****
    2014: ***
    2015: **********
    2016: ****************
    2017: ****************
    2018: **************
    2019: ***************
    2020: **********************
    2021: ********************
    2022: **************
    2023: *******

    2023 is tracking for 35 entries based on 39% of year complete (as of 25 may 2023, the data cut-off date).

    Noticeable bump in 2020 as well.

    #HackerNews #HackerNewsAnalytics #MediaAnalysis

  27. Things about which Hacker News cares being down, and of which it has noticed:

    Skype network is down, possibly under viral DoS attack. Lessons?
    Is this why Twitter is down? Their Engineer Speaks
    Amazon is down ... implications for AWS?
    The Website Is Down (Hilarious 10 Minute Video)
    Matthew Simmons: The only way is down
    GitHub is down
    KK on Unabomber: pounce on [technology] when it is down and kill it before it rises again
    Yes, Rackspace Is Down And So Are Many Of Your Favorite Sites
    Tell HN: Authorize.net is down
    Dreamhost is down. All of it.
    Most of Slicehost is Down
    Ubisoft DRM authentification server is down, Assassin's Creed 2 unplayable
    Dropbox is down
    Heroku is down for the third time today
    Tumblr is Down – Fans Angry
    Great. Skype is down.
    Reddit Is Down To One Developer
    AWS is down, but here's why the sky is falling
    Amazon EC2 EU-West is down
    Reddit is down for 12 hours protest SOPA and PIPA.
    Java.sun.com is down again - breaking bad apps across the land
    Heroku is down
    Tell HN: Heroku is Down (update: recovering as of 10PM PST)
    AWS is down due to an electrical storm in the US
    Heroku is down again
    Google Talk is down
    GoDaddy's DNS Service is Down
    Github is down
    Netflix is Down
    Hacker News is down, so we made five issues free
    This site is down because the owner stiffed the web designer
    Dropbox is down
    WhatsApp is down
    DreamObjects is down
    Facebook is down (09:08AM PDT Aug 1, 2014)
    YTMND is down for temporary maintenance
    Google Cloud Is Down
    GitHub is down
    DigitalOcean block storage is down
    Firefox usage is down despite Mozilla's top exec pay going up
    Slack is down
    [dupe] Slack is down
    Tell HN: GitHub is down again
    Kiwi Farms is down across all domains as DDoS-Guard terminates service
    Twitter's API is down?

    #HackerNews #HackerNewsAnalytics #MediaAnalysis

  28. According to the Hacker News front page, there are ...:

    • 313 things that suck.
    • 18 things that will fail.
    • 116 things that rock.
    • 157 things that are awesome.
    • 0 things that are bollocks.
    • 685 things that are great.
    • 75 things that are terrible.
    • 1 thing that is both terrible and amazing. And it is you.
    • 28 things that are horrible.
    • 22 things that are a list of some number of things.
    • 33 things that are a list of some number of reasons.
    • 0 hot takes.
    • 3,101 things that are how to's.
    • 6,434 things that are "hows" but not how to's.
    • 98 things that are how not to's.
    • 21 things that are silly.
    • 86 things that are clever.
    • 318 things that are smart, none of which are phones.
    • 58 things that are brilliant.
    • 147 things that are stupid.
    • 20 things that are terrifying.
    • 19 things that you must do.

    Edit: Hashtag surgery (whitespace in hashtags is a thing that sucks).

    #HackerNews #HackerNewsAnalytics #TooMuchFunWithGrep #Suck #Fail #Rock #Awesome #Bollocks

  29. One of the challenges of having an Eminently Queryable Data Trove is ... deciding what to query it about.

    I've long thought that HN was fairly obsessed with various aspects of the hiring process, from both employer and worker perspectives.

    Let's check that ...

    $ egrep -i '(interview|hiring|recruiting)' <(grep '^  Title:' parse.log ) | wc -l
    1282

    Ayup.

    That's 1,282 stories out of 178,072, or just over 0.7%, but still a healthy chunk. By contrast, "housing" gets 90 hits, "Tesla" 413, and "Musk", 114.

    Or the FAANG+M set:

    Facebook:      2,414
    Apple: 2,495
    Amazon: 1,467
    Netflix: 326
    Google: 5,900
    Microsoft: 1,523

    I'm still trying to sort out a way to search / determine "statistically interesting terms", that is words or phrases which are disproportionately represented in submission titles.

    #HackerNewsAnalytics #hiring #interviews #recruiting