home.social

#datasets — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #datasets, aggregated by home.social.

  1. UC San Diego: From Molecules to Meaning: A Search Engine for the Chemistry of Life. “An international team led by researchers at University of California San Diego and University of California, Riverside has developed a free, web-based platform designed to make public metabolomics data more accessible.”

    https://rbfirehose.com/2026/05/14/from-molecules-to-meaning-a-search-engine-for-the-chemistry-of-life-uc-san-diego/
  2. UC San Diego: From Molecules to Meaning: A Search Engine for the Chemistry of Life. “An international team led by researchers at University of California San Diego and University of California, Riverside has developed a free, web-based platform designed to make public metabolomics data more accessible.”

    https://rbfirehose.com/2026/05/14/from-molecules-to-meaning-a-search-engine-for-the-chemistry-of-life-uc-san-diego/
  3. UC San Diego: From Molecules to Meaning: A Search Engine for the Chemistry of Life. “An international team led by researchers at University of California San Diego and University of California, Riverside has developed a free, web-based platform designed to make public metabolomics data more accessible.”

    https://rbfirehose.com/2026/05/14/from-molecules-to-meaning-a-search-engine-for-the-chemistry-of-life-uc-san-diego/
  4. UC San Diego: From Molecules to Meaning: A Search Engine for the Chemistry of Life. “An international team led by researchers at University of California San Diego and University of California, Riverside has developed a free, web-based platform designed to make public metabolomics data more accessible.”

    https://rbfirehose.com/2026/05/14/from-molecules-to-meaning-a-search-engine-for-the-chemistry-of-life-uc-san-diego/
  5. UC San Diego: From Molecules to Meaning: A Search Engine for the Chemistry of Life. “An international team led by researchers at University of California San Diego and University of California, Riverside has developed a free, web-based platform designed to make public metabolomics data more accessible.”

    https://rbfirehose.com/2026/05/14/from-molecules-to-meaning-a-search-engine-for-the-chemistry-of-life-uc-san-diego/
  6. The Guardian: ‘Things were going dark left and right’: the race to save US government datasets before they’re deleted. “André is part of a group of ‘data rescuers’ who have banded together during Trump’s second term. They have been quietly racing to save hundreds of critical government datasets before they are no longer available. Now known as the Data Rescue Project, it’s a […]

    https://rbfirehose.com/2026/05/09/things-were-going-dark-left-and-right-the-race-to-save-us-government-datasets-before-theyre-deleted-the-guardian/
  7. The Guardian: ‘Things were going dark left and right’: the race to save US government datasets before they’re deleted. “André is part of a group of ‘data rescuers’ who have banded together during Trump’s second term. They have been quietly racing to save hundreds of critical government datasets before they are no longer available. Now known as the Data Rescue Project, it’s a […]

    https://rbfirehose.com/2026/05/09/things-were-going-dark-left-and-right-the-race-to-save-us-government-datasets-before-theyre-deleted-the-guardian/
  8. The Guardian: ‘Things were going dark left and right’: the race to save US government datasets before they’re deleted. “André is part of a group of ‘data rescuers’ who have banded together during Trump’s second term. They have been quietly racing to save hundreds of critical government datasets before they are no longer available. Now known as the Data Rescue Project, it’s a […]

    https://rbfirehose.com/2026/05/09/things-were-going-dark-left-and-right-the-race-to-save-us-government-datasets-before-theyre-deleted-the-guardian/
  9. The Guardian: ‘Things were going dark left and right’: the race to save US government datasets before they’re deleted. “André is part of a group of ‘data rescuers’ who have banded together during Trump’s second term. They have been quietly racing to save hundreds of critical government datasets before they are no longer available. Now known as the Data Rescue Project, it’s a […]

    https://rbfirehose.com/2026/05/09/things-were-going-dark-left-and-right-the-race-to-save-us-government-datasets-before-theyre-deleted-the-guardian/
  10. The Guardian: ‘Things were going dark left and right’: the race to save US government datasets before they’re deleted. “André is part of a group of ‘data rescuers’ who have banded together during Trump’s second term. They have been quietly racing to save hundreds of critical government datasets before they are no longer available. Now known as the Data Rescue Project, it’s a […]

    https://rbfirehose.com/2026/05/09/things-were-going-dark-left-and-right-the-race-to-save-us-government-datasets-before-theyre-deleted-the-guardian/
  11. "Ironically, several of the people who had been included in the set without any consent are known for their work critiquing surveillance and facial recognition itself, including filmmaker Laura Poitras, digital rights activist Jillian York, critic Evgeny Morozov, and author of Surveillance Capitalism Shoshana Zuboff. "

    (re Microsoft's MS-CELEB)

    excavating.ai

    #AI #Surveillance #Datasets #ImageNet #Microsoft #MS-CELEB #KateCrawford

  12. University of Edinburgh: AI fails to make inroads with cybercriminals. “Cybercriminals have been struggling to adopt AI in their work, reports the first of its kind study that analysed a dataset of 100 million posts from underground cybercrime communities.”

    https://rbfirehose.com/2026/05/05/university-of-edinburgh-ai-fails-to-make-inroads-with-cybercriminals/
  13. Por más que tengan buenas intenciones, lo que para ustedes podría ser un uso «ético y responsable» es avalar y legitimar la vulneración de derechos sistemática que sostiene toda la industria de la IA generativa comercial.

    📌 Ningún modelo de IAG comercial funciona sin VIOLAR derechos de autor.

    #IA #IAgenerativa #AI #genAI #generativeAI #datasets #theft #technology #ethics

  14. Arizona State University: Largest genomic dataset of Indigenous Americans to date sheds light on history, diversity and health. “In a new study published today in Nature, an international team led by the Institute of Evolutionary Biology, with partners at the University of São Paulo and Arizona State University, analyzed genomes from Indigenous populations spanning North America to Patagonia. […]

    https://rbfirehose.com/2026/04/27/arizona-state-university-largest-genomic-dataset-of-indigenous-americans-to-date-sheds-light-on-history-diversity-and-health/
  15. Max-Planck-Gesellschaft: Largest open dataset of great ape cognition. “A new publication introduces the EVApeCognition Dataset, a major open-access resource designed to advance research into the cognition of great apes. Compiling 262 experimental datasets from 150 scientific publications, the dataset was produced at the Wolfgang Köhler Primate Research Center in Leipzig, Germany, between 2004 […]

    https://rbfirehose.com/2026/04/24/max-planck-gesellschaft-largest-open-dataset-of-great-ape-cognition/
  16. USGS: Land Treatment Digital Library Version 2.0 Launch. “The U.S. Geological Survey launched an updated version (2.0) of the LTDL to improve user experience, include additional data, and enhance BLM access. Notable additions to the website include interactive figures for each treatment polygon that display the monthly average temperature and precipitation from PRISM Climate Group at Oregon […]

    https://rbfirehose.com/2026/04/07/usgs-land-treatment-digital-library-version-2-0-launch/
  17. New paper from us: "A dataset of insect sounds from 459 species for bioacoustic machine learning", published in Scientific Data, led by Marius Faiß doi.org/10.1038/s41597-026-071 #bioacoustics #datasets

  18. arXiv: GoogleTrendArchive: A Year-Long Archive of Real-Time Web Search Trends Worldwide. “Unlike Google Trends, which requires specifying search terms in advance, Trending Now captures search queries experiencing real-time surges, offering a way to inductively discover trending patterns across regions for studying collective attention dynamics. However, Google does not provide historical access […]

    https://rbfirehose.com/2026/03/25/googletrendarchive-a-year-long-archive-of-real-time-web-search-trends-worldwide-arxiv/
  19. arXiv: GoogleTrendArchive: A Year-Long Archive of Real-Time Web Search Trends Worldwide. “Unlike Google Trends, which requires specifying search terms in advance, Trending Now captures search queries experiencing real-time surges, offering a way to inductively discover trending patterns across regions for studying collective attention dynamics. However, Google does not provide historical access […]

    https://rbfirehose.com/2026/03/25/googletrendarchive-a-year-long-archive-of-real-time-web-search-trends-worldwide-arxiv/
  20. arXiv: GoogleTrendArchive: A Year-Long Archive of Real-Time Web Search Trends Worldwide. “Unlike Google Trends, which requires specifying search terms in advance, Trending Now captures search queries experiencing real-time surges, offering a way to inductively discover trending patterns across regions for studying collective attention dynamics. However, Google does not provide historical access […]

    https://rbfirehose.com/2026/03/25/googletrendarchive-a-year-long-archive-of-real-time-web-search-trends-worldwide-arxiv/
  21. ‘Thousands of authors including Kazuo Ishiguro, Philippa Gregory and Richard Osman have published an “empty” #book to protest against #AI firms using their work without permission.’

    theguardian.com/technology/202

    Yet #copyright itself has long been criticised as part of broader systems of enclosure and #SettlerViolence. So the assertion of copyright is not a victimless crime, any more than is the training of AI #chatbots and image generators on vast #datasets (often scraped without permission from the open web, digital repositories and shadow or #pirate libraries containing copyrighted books).

    So what exactly is being defended here? Do the authors protesting against the training of AI really not know the long history of critique of copyright? Or do they know perfectly and are just too selfish and are profiting too much from it themselves to want to challenge it or think of something different?

    #DefundCulture

  22. BlueSky’s Solution To Moderating Is Moderating Without Moderating via Social Proximity

    I have noticed a lot of people are confused about why some posts don’t show up on threads, though they are not labeled by the moderation layer. Bluesky has begun using what it calls social neighborhoods (or network proximity) as a ranking signal for replies in threads. Replies from people who are closer to you in the social graph, accounts you follow, interact with, or share mutual connections with, are prioritized and shown more prominently. Replies from accounts that are farther away in that network are down-ranked. They are pushed far down the thread or placed behind “hidden replies.”

    Each person gets their own unique view of a thread based on their social graph. It creates the impression that replies from distant users simply don’t exist. This is true even though they’re still technically public and viewable if you expand the thread or adjust filters. Bluesky is explicitly using features of subgraphs to moderate without moderating. Their reasoning is that if you can’t see each other, you can’t harass each other. Ergo, there is nothing to moderate.

    Bluesky mentions that here:

    https://bsky.social/about/blog/10-31-2025-building-healthier-social-media-update

    As a digression, I’m not going to lie: I really enjoyed working on software built on the AT protocol, but their fucking users are so goddamn weird. It’s sort of like enjoying building houses, but hating every single person who moves into them. But, you don’t have to deal with them because you’re just the contractor. That is how I feel about Bluesky. I hate the people. I really like the protocol and infrastructure.

    I sort of am a sadist who does enjoy drama, so I do get schadenfreude from people with social media addictions and parasocial fixations who reply to random people on Bluesky, because they don’t realize their replies are disconnected from the author’s thread unless that person is within their network. They aren’t part of the conversation they think they are. They’re algorithmically isolated from everyone else. Their replies aren’t viewable from the author’s thread because of how Bluesky handles social neighborhoods.

    Bluesky’s idea of social neighborhoods is about grouping users into overlapping clusters based on real interaction patterns rather than just the follow graph. Unlike Twitter, it does not treat the network as one big public square. Instead, it models networks of “social neighborhoods” made up of people you follow, people who follow you, people you frequently interact with, and people who are closely connected to those groups. They’re soft, probabilistic groupings rather than strict labels.

    Everyone does not see the same replies. Bluesky is being a bit vague with “hidden.” Hidden means your reply is still anchored to the thread and can be expanded. There is another way Bluesky can handle this. Bluesky uses social neighborhoods to judge contextual relevance. Replies from people inside or near your social neighborhood are more likely to be shown inline with a thread, expanded by default, or served in feeds. Replies from outside your neighborhood are still public and still indexed, but they’re treated as lower-context contributions.

    Basically, if you reply to a thread, you will see it anchored to the conversation, and everyone will see it in search results, as a hashtag, or from your profile, but it will not be accessible via the thread of the person you were replying to. It is like shadow-banning people from threads unless they are strongly networked.

    Because people have not been working with the AT Protocol like I have, they assume they are shadow-banned across the entire Bluesky app view. No—everyone is automatically shadow-banned from everyone else unless they are within the same social neighborhood. In other words, you are not part of the conversation you think you are joining because you are not part of their social group.

    Your replies will appear in profiles, hashtag feeds, or search results without being visually anchored to the full thread. Discovery impressions are neighborhood-agnostic: they serve content because it matches a query, tag, or activity stream. Once the reply is shown, the app then decides whether it’s worth pulling in the rest of the conversation for you. If the original author and most participants fall outside your neighborhood, Bluesky often chooses not to expand that context automatically.

    Bluesky really is trying to avoid having to moderate, so this is their solution. Instead of banning or issuing takedown labels to DIDs, the system lets replies exist everywhere, but not in that particular instance of the thread.

    I find this ironic because a large reason why many people are staying on Bluesky and not moving to the fediverse—thank God, because I do not want them there—is discoverability, virality, and engagement.

    In case anyone is asking how I know so much about how these algorithms work: I was a consultant on a lot of these types of algorithms, so I certainly hope I’d know how they work, lol. No, you get no more details about the work I’ve done. I have no hand in the algorithm Bluesky is using, but I have proposed and implemented that type of algorithm before.

    I have an interest in noetics and the noosphere. A large amount of my ontological work is an extension of my attempts to model domains that have no spatial or temporal coordinates. The question is how do you generalize a metric space that has no physically, spatial properties. I went to school to try to formalize those ideas. Turns out they’re rather useful for digital social networks, too. The ontological analog to spatial distance, when you have no space, is a graph of similarities.

    This can be modeled by representing each item as a node in a weighted graph, where edges are weighted by dissimilarity rather than similarity. Highly similar items are connected by low-weight edges, while less similar items are connected by higher-weight edges. Distances in the graph, computed using standard shortest-path algorithms, then correspond to degrees of similarity. Closely related items are separated by short path lengths, while increasingly dissimilar items require longer paths through the graph. It turns out that attempts to generalize metric spaces for noetic domains—to model noetic/psychic spaces—are actually pretty useful for social media algorithms, lol.

  23. BlueSky’s Solution To Moderating Is Moderating Without Moderating via Social Proximity

    I have noticed a lot of people are confused about why some posts don’t show up on threads, though they are not labeled by the moderation layer. Bluesky has begun using what it calls social neighborhoods (or network proximity) as a ranking signal for replies in threads. Replies from people who are closer to you in the social graph, accounts you follow, interact with, or share mutual connections with, are prioritized and shown more prominently. Replies from accounts that are farther away in that network are down-ranked. They are pushed far down the thread or placed behind “hidden replies.”

    Each person gets their own unique view of a thread based on their social graph. It creates the impression that replies from distant users simply don’t exist. This is true even though they’re still technically public and viewable if you expand the thread or adjust filters. Bluesky is explicitly using features of subgraphs to moderate without moderating. Their reasoning is that if you can’t see each other, you can’t harass each other. Ergo, there is nothing to moderate.

    Bluesky mentions that here:

    https://bsky.social/about/blog/10-31-2025-building-healthier-social-media-update

    As a digression, I’m not going to lie: I really enjoyed working on software built on the AT protocol, but their fucking users are so goddamn weird. It’s sort of like enjoying building houses, but hating every single person who moves into them. But, you don’t have to deal with them because you’re just the contractor. That is how I feel about Bluesky. I hate the people. I really like the protocol and infrastructure.

    I sort of am a sadist who does enjoy drama, so I do get schadenfreude from people with social media addictions and parasocial fixations who reply to random people on Bluesky, because they don’t realize their replies are disconnected from the author’s thread unless that person is within their network. They aren’t part of the conversation they think they are. They’re algorithmically isolated from everyone else. Their replies aren’t viewable from the author’s thread because of how Bluesky handles social neighborhoods.

    Bluesky’s idea of social neighborhoods is about grouping users into overlapping clusters based on real interaction patterns rather than just the follow graph. Unlike Twitter, it does not treat the network as one big public square. Instead, it models networks of “social neighborhoods” made up of people you follow, people who follow you, people you frequently interact with, and people who are closely connected to those groups. They’re soft, probabilistic groupings rather than strict labels.

    Everyone does not see the same replies. Bluesky is being a bit vague with “hidden.” Hidden means your reply is still anchored to the thread and can be expanded. There is another way Bluesky can handle this. Bluesky uses social neighborhoods to judge contextual relevance. Replies from people inside or near your social neighborhood are more likely to be shown inline with a thread, expanded by default, or served in feeds. Replies from outside your neighborhood are still public and still indexed, but they’re treated as lower-context contributions.

    Basically, if you reply to a thread, you will see it anchored to the conversation, and everyone will see it in search results, as a hashtag, or from your profile, but it will not be accessible via the thread of the person you were replying to. It is like shadow-banning people from threads unless they are strongly networked.

    Because people have not been working with the AT Protocol like I have, they assume they are shadow-banned across the entire Bluesky app view. No—everyone is automatically shadow-banned from everyone else unless they are within the same social neighborhood. In other words, you are not part of the conversation you think you are joining because you are not part of their social group.

    Your replies will appear in profiles, hashtag feeds, or search results without being visually anchored to the full thread. Discovery impressions are neighborhood-agnostic: they serve content because it matches a query, tag, or activity stream. Once the reply is shown, the app then decides whether it’s worth pulling in the rest of the conversation for you. If the original author and most participants fall outside your neighborhood, Bluesky often chooses not to expand that context automatically.

    Bluesky really is trying to avoid having to moderate, so this is their solution. Instead of banning or issuing takedown labels to DIDs, the system lets replies exist everywhere, but not in that particular instance of the thread.

    I find this ironic because a large reason why many people are staying on Bluesky and not moving to the fediverse—thank God, because I do not want them there—is discoverability, virality, and engagement.

    In case anyone is asking how I know so much about how these algorithms work: I was a consultant on a lot of these types of algorithms, so I certainly hope I’d know how they work, lol. No, you get no more details about the work I’ve done. I have no hand in the algorithm Bluesky is using, but I have proposed and implemented that type of algorithm before.

    I have an interest in noetics and the noosphere. A large amount of my ontological work is an extension of my attempts to model domains that have no spatial or temporal coordinates. The question is how do you generalize a metric space that has no physically, spatial properties. I went to school to try to formalize those ideas. Turns out they’re rather useful for digital social networks, too. The ontological analog to spatial distance, when you have no space, is a graph of similarities.

  24. BlueSky’s Solution To Moderating Is Moderating Without Moderating via Social Proximity

    I have noticed a lot of people are confused about why some posts don’t show up on threads, though they are not labeled by the moderation layer. Bluesky has begun using what it calls social neighborhoods (or network proximity) as a ranking signal for replies in threads. Replies from people who are closer to you in the social graph, accounts you follow, interact with, or share mutual connections with, are prioritized and shown more prominently. Replies from accounts that are farther away in that network are down-ranked. They are pushed far down the thread or placed behind “hidden replies.”

    Each person gets their own unique view of a thread based on their social graph. It creates the impression that replies from distant users simply don’t exist. This is true even though they’re still technically public and viewable if you expand the thread or adjust filters. Bluesky is explicitly using features of subgraphs to moderate without moderating. Their reasoning is that if you can’t see each other, you can’t harass each other. Ergo, there is nothing to moderate.

    Bluesky mentions that here:

    https://bsky.social/about/blog/10-31-2025-building-healthier-social-media-update

    As a digression, I’m not going to lie: I really enjoyed working on software built on the AT protocol, but their fucking users are so goddamn weird. It’s sort of like enjoying building houses, but hating every single person who moves into them. But, you don’t have to deal with them because you’re just the contractor. That is how I feel about Bluesky. I hate the people. I really like the protocol and infrastructure.

    I sort of am a sadist who does enjoy drama, so I do get schadenfreude from people with social media addictions and parasocial fixations who reply to random people on Bluesky, because they don’t realize their replies are disconnected from the author’s thread unless that person is within their network. They aren’t part of the conversation they think they are. They’re algorithmically isolated from everyone else. Their replies aren’t viewable from the author’s thread because of how Bluesky handles social neighborhoods.

    Bluesky’s idea of social neighborhoods is about grouping users into overlapping clusters based on real interaction patterns rather than just the follow graph. Unlike Twitter, it does not treat the network as one big public square. Instead, it models networks of “social neighborhoods” made up of people you follow, people who follow you, people you frequently interact with, and people who are closely connected to those groups. They’re soft, probabilistic groupings rather than strict labels.

    Everyone does not see the same replies. Bluesky is being a bit vague with “hidden.” Hidden means your reply is still anchored to the thread and can be expanded. There is another way Bluesky can handle this. Bluesky uses social neighborhoods to judge contextual relevance. Replies from people inside or near your social neighborhood are more likely to be shown inline with a thread, expanded by default, or served in feeds. Replies from outside your neighborhood are still public and still indexed, but they’re treated as lower-context contributions.

    Basically, if you reply to a thread, you will see it anchored to the conversation, and everyone will see it in search results, as a hashtag, or from your profile, but it will not be accessible via the thread of the person you were replying to. It is like shadow-banning people from threads unless they are strongly networked.

    Because people have not been working with the AT Protocol like I have, they assume they are shadow-banned across the entire Bluesky app view. No—everyone is automatically shadow-banned from everyone else unless they are within the same social neighborhood. In other words, you are not part of the conversation you think you are joining because you are not part of their social group.

    Your replies will appear in profiles, hashtag feeds, or search results without being visually anchored to the full thread. Discovery impressions are neighborhood-agnostic: they serve content because it matches a query, tag, or activity stream. Once the reply is shown, the app then decides whether it’s worth pulling in the rest of the conversation for you. If the original author and most participants fall outside your neighborhood, Bluesky often chooses not to expand that context automatically.

    Bluesky really is trying to avoid having to moderate, so this is their solution. Instead of banning or issuing takedown labels to DIDs, the system lets replies exist everywhere, but not in that particular instance of the thread.

    I find this ironic because a large reason why many people are staying on Bluesky and not moving to the fediverse—thank God, because I do not want them there—is discoverability, virality, and engagement.

    In case anyone is asking how I know so much about how these algorithms work: I was a consultant on a lot of these types of algorithms, so I certainly hope I’d know how they work, lol. No, you get no more details about the work I’ve done. I have no hand in the algorithm Bluesky is using, but I have proposed and implemented that type of algorithm before.

    I have an interest in noetics and the noosphere. A large amount of my ontological work is an extension of my attempts to model domains that have no spatial or temporal coordinates. The question is how do you generalize a metric space that has no physically, spatial properties. I went to school to try to formalize those ideas. Turns out they’re rather useful for digital social networks, too. The ontological analog to spatial distance, when you have no space, is a graph of similarities.

    This can be modeled by representing each item as a node in a weighted graph, where edges are weighted by dissimilarity rather than similarity. Highly similar items are connected by low-weight edges, while less similar items are connected by higher-weight edges. Distances in the graph, computed using standard shortest-path algorithms, then correspond to degrees of similarity. Closely related items are separated by short path lengths, while increasingly dissimilar items require longer paths through the graph. It turns out that attempts to generalize metric spaces for noetic domains—to model noetic/psychic spaces—are actually pretty useful for social media algorithms, lol.

  25. BlueSky’s Solution To Moderating Is Moderating Without Moderating via Social Proximity

    I have noticed a lot of people are confused about why some posts don’t show up on threads, though they are not labeled by the moderation layer. Bluesky has begun using what it calls social neighborhoods (or network proximity) as a ranking signal for replies in threads. Replies from people who are closer to you in the social graph, accounts you follow, interact with, or share mutual connections with, are prioritized and shown more prominently. Replies from accounts that are farther away in that network are down-ranked. They are pushed far down the thread or placed behind “hidden replies.”

    Each person gets their own unique view of a thread based on their social graph. It creates the impression that replies from distant users simply don’t exist. This is true even though they’re still technically public and viewable if you expand the thread or adjust filters. Bluesky is explicitly using features of subgraphs to moderate without moderating. Their reasoning is that if you can’t see each other, you can’t harass each other. Ergo, there is nothing to moderate.

    Bluesky mentions that here:

    https://bsky.social/about/blog/10-31-2025-building-healthier-social-media-update

    As a digression, I’m not going to lie: I really enjoyed working on software built on the AT protocol, but their fucking users are so goddamn weird. It’s sort of like enjoying building houses, but hating every single person who moves into them. But, you don’t have to deal with them because you’re just the contractor. That is how I feel about Bluesky. I hate the people. I really like the protocol and infrastructure.

    I sort of am a sadist who does enjoy drama, so I do get schadenfreude from people with social media addictions and parasocial fixations who reply to random people on Bluesky, because they don’t realize their replies are disconnected from the author’s thread unless that person is within their network. They aren’t part of the conversation they think they are. They’re algorithmically isolated from everyone else. Their replies aren’t viewable from the author’s thread because of how Bluesky handles social neighborhoods.

    Bluesky’s idea of social neighborhoods is about grouping users into overlapping clusters based on real interaction patterns rather than just the follow graph. Unlike Twitter, it does not treat the network as one big public square. Instead, it models networks of “social neighborhoods” made up of people you follow, people who follow you, people you frequently interact with, and people who are closely connected to those groups. They’re soft, probabilistic groupings rather than strict labels.

    Everyone does not see the same replies. Bluesky is being a bit vague with “hidden.” Hidden means your reply is still anchored to the thread and can be expanded. There is another way Bluesky can handle this. Bluesky uses social neighborhoods to judge contextual relevance. Replies from people inside or near your social neighborhood are more likely to be shown inline with a thread, expanded by default, or served in feeds. Replies from outside your neighborhood are still public and still indexed, but they’re treated as lower-context contributions.

    Basically, if you reply to a thread, you will see it anchored to the conversation, and everyone will see it in search results, as a hashtag, or from your profile, but it will not be accessible via the thread of the person you were replying to. It is like shadow-banning people from threads unless they are strongly networked.

    Because people have not been working with the AT Protocol like I have, they assume they are shadow-banned across the entire Bluesky app view. No—everyone is automatically shadow-banned from everyone else unless they are within the same social neighborhood. In other words, you are not part of the conversation you think you are joining because you are not part of their social group.

    Your replies will appear in profiles, hashtag feeds, or search results without being visually anchored to the full thread. Discovery impressions are neighborhood-agnostic: they serve content because it matches a query, tag, or activity stream. Once the reply is shown, the app then decides whether it’s worth pulling in the rest of the conversation for you. If the original author and most participants fall outside your neighborhood, Bluesky often chooses not to expand that context automatically.

    Bluesky really is trying to avoid having to moderate, so this is their solution. Instead of banning or issuing takedown labels to DIDs, the system lets replies exist everywhere, but not in that particular instance of the thread.

    I find this ironic because a large reason why many people are staying on Bluesky and not moving to the fediverse—thank God, because I do not want them there—is discoverability, virality, and engagement.

    In case anyone is asking how I know so much about how these algorithms work: I was a consultant on a lot of these types of algorithms, so I certainly hope I’d know how they work, lol. No, you get no more details about the work I’ve done. I have no hand in the algorithm Bluesky is using, but I have proposed and implemented that type of algorithm before.

    I have an interest in noetics and the noosphere. A large amount of my ontological work is an extension of my attempts to model domains that have no spatial or temporal coordinates. The question is how do you generalize a metric space that has no physically, spatial properties. I went to school to try to formalize those ideas. Turns out they’re rather useful for digital social networks, too. The ontological analog to spatial distance, when you have no space, is a graph of similarities.

    This can be modeled by representing each item as a node in a weighted graph, where edges are weighted by dissimilarity rather than similarity. Highly similar items are connected by low-weight edges, while less similar items are connected by higher-weight edges. Distances in the graph, computed using standard shortest-path algorithms, then correspond to degrees of similarity. Closely related items are separated by short path lengths, while increasingly dissimilar items require longer paths through the graph. It turns out that attempts to generalize metric spaces for noetic domains—to model noetic/psychic spaces—are actually pretty useful for social media algorithms, lol.

  26. Prison Policy Initiative: Resource spotlight: Data projects tracking police misconduct, use of force, and employment histories. “The need for law enforcement transparency, oversight, and accountability has never been clearer. We highlight data projects that have helped document and investigate misconduct, as both data sources and as models for others who want to contribute to these collective […]

    https://rbfirehose.com/2026/01/29/resource-spotlight-data-projects-tracking-police-misconduct-use-of-force-and-employment-histories-prison-policy-initiative/
  27. Natural Hazards: Generating landslide archive inventories for Türkiye using web scraping and natural language processing techniques. “…we developed an automated approach that integrates web scraping, natural language processing (NLP), and geocoding techniques using digital media news sources in Türkiye to create a landslide archive inventory. Our algorithm verified 1727 of the 3051 news […]

    https://rbfirehose.com/2025/12/29/natural-hazards-generating-landslide-archive-inventories-for-turkiye-using-web-scraping-and-natural-language-processing-techniques/
  28. Syd Wiencek: Predicting the Winner of Rupaul Drag Race Season 18 Based on Collected Data. “We began by collecting the ages, hometowns, races, regions, and track records of the 17 previous winners. When looking at each winner’s track record, we determined the most popular challenges to win across the seasons in order to find which combination of challenge wins would be the most beneficial in […]

    https://rbfirehose.com/2025/12/16/syd-wiencek-predicting-the-winner-of-rupaul-drag-race-season-18-based-on-collected-data/

  29. Maps Mania: Introducing the Global Building Atlas. “The Global Building Atlas is a new global, high-resolution 3D dataset of the world’s 2.75 billion buildings. Developed by a research team at the Technical University of Munich (TUM) the Global Building Atlas provides broad coverage of areas historically missing from digital maps, including much of Africa, South America, and rural areas […]

    https://rbfirehose.com/2025/12/05/maps-mania-introducing-the-global-building-atlas/

  30. Some suggestions of open source tools for data #analytics for people thinking which tools to use or consider to use.

    #Plausible for #web analytics. It's very lightweight and #privacy-friendly, #GDPR-compliant. It's possible to self-host, but their #SaaS offering is affordable and meets needs.

    #Metabase (self-hosted) for #business intelligence and organizing business/customer #data. It takes some time to configure and prepare #datasets, but for long-term is worthy.

    #Clickhouse for sub-second #OLAP analytics.

    Depending on projects/business scenarios, Apache Software Foundation's tools like #Doris, #Airflow, #Druid, #Flink, #Cassandra. They require some time to learn, but it's good idea to be familiar with them.
    #dataanalytics #business #opensource #tech

  31. Reuters: USDA Migrates Data Archive to New Website, Dropping Cornell’s Mann Library. “Archived crop and livestock reports from the U.S. Department of Agriculture were set to transfer to a new government website on Wednesday with the agency’s existing online archive, hosted by Cornell University’s Mann Library, decommissioned, the USDA and a Cornell official said on Tuesday.”

    https://rbfirehose.com/2025/10/05/reuters-usda-migrates-data-archive-to-new-website-dropping-cornells-mann-library/

  32. "On Friday, numerous essential #datasets were #purged from federal agency websites, including #data from #CDC PLACES (Population Level Analysis and Community Estimates), the Social Vulnerability Index (SVI), and the Climate and Economic Justice Screening Tool (CEJST)—to name just a few. While we don’t know when or if this data will return, we want to assure you that they are still accessible on our platform." policymap.com/blog/purged-fede #PolicyMap #PublicHealth #USPol #Project2025 #CivilRights

  33. I scaled up the popular Palmer Penguins machine learning dataset from 344 rows to 100k rows using adversarial random forest, with an accuracy of 88%.

    Now, you have more rows of data with which to train your classification models.

    You can download it here, along with R & Python scripts, to load and view the dataset: ieee-dataport.org/documents/pa

    Have a dataset you want to scale up? Say hello!

    #machinelearning #randomforest #rstats #python #datascience #datasets #syntheticdatageneration #ai

  34. The University of Groningen announced the renewal of its cooperation agreement with #CBS (Statistics Netherlands) 🎉🎉

    Using CBS #microdata with existing #datasets, #researchers can expedite their investigations, avoid duplicative efforts, and allocate resources more efficiently.

    The UG Digital Competence Centre supports researchers in navigating the CBS platform and maximizing its potential.

    📰 Read more: t.co/O3I9G6lcrZ

    ℹ️ Learn more about CBS microdata: t.co/WfAhbiANe3

  35. Have the open source and open data communities, including organizations like the @eff, @creativecommons, or the @fsf, given any thought yet to updating various #FOSS and other licenses to address the current #SaaS problem of code or data that isn't necessarily being "redistributed," allowing these companies to dodge the obligation to contribute changes back upstream? How about the privatization and unauthorized commercialization of material licensed under the #GPLv3 and #FDL, #CreativeCommons licenses, and other open-license content that is often scooped up regardless of licensing into #AI #datasets that are then put behind #paywalls?

    To me, this seems very similar to the #Tivoization problem that led to the evolution of #GPLv2 to #GPLv3. It seems wrong that #OpenAI or #GitHub_Copilot can profit by putting licensed code, writing, or other data into a walled garden where even the original contributors that they rely on are charged for access.

    I'm not anti-business. If these companies were at least making the data sets freely available, there's nothing intrinsically wrong with making value-added profit off of properly-licensed data, although examples like CC-BY-NC 4.0 are a notable exception that should also be considered. Companies like Canonical, Red Hat, IBM, and others have been making money legally off of open source software for decades.

    Just because the label "AI" is slapped on something doesn't mean that companies should be allowed to ignore copyrights or licensing terms. If they want to do that, and licensing or requiring free access to open-content data can't prevent this land-grab, perhaps its time we collectively revisit the whole framework around #intellectualproperty that currently allows corporations like #Disney and uncountable #PatentTrolls to create ever-expanding assertions of property rights that prevent almost any material from entering the public domain within a single human lifetime.