#ngram — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #ngram, aggregated by home.social.
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Decline and fall of NOTWITHSTANDING (preposition, conjunction, adverb) #language #English #style #composition #linguistics #edchat #discourse #connectives #ngram
-
Decline and fall of NOTWITHSTANDING (preposition, conjunction, adverb) #language #English #style #composition #linguistics #edchat #discourse #connectives #ngram
-
Decline and fall of NOTWITHSTANDING (preposition, conjunction, adverb) #language #English #style #composition #linguistics #edchat #discourse #connectives #ngram
-
Decline and fall of NOTWITHSTANDING (preposition, conjunction, adverb) #language #English #style #composition #linguistics #edchat #discourse #connectives #ngram
-
Decline and fall of NOTWITHSTANDING (preposition, conjunction, adverb) #language #English #style #composition #linguistics #edchat #discourse #connectives #ngram
-
les exercices sur #ngram sont chouettes aussi pour apprendre l'#ergol mais seulement je fois qu'on a bien potassé la mémoire mécanique ailleurs.
un an après avoir commencé à être en mesure de faire des phrases en dactylo en ergol, j'ai toujours du mal sur les enchaînements d'annuaire et auriculaire, mais que de la main droite.
probablement une histoire d'appui de longue date concernant une habitude de dessin où ces doigts me servaient surtout de support pour préciser mon trait.
-
Recently I've combined various functions which I've been using in other projects (e.g. my personal PKM toolchain) and published them as new library https://thi.ng/text-analysis for better re-use:
- customizable, composable & extensible tokenization (transducer based)
- ngram generation
- Porter-stemming & stopword removal
- vocabulary (bi-directional index) creation
- dense & sparse multi-hot vector encoding/decoding
- histograms (incl. sorted versions)
- tf-idf (term frequency & inverse document frequency), multiple strategies
- k-means clustering (with k-means++ initialization & customizable distance metrics)
- similarity/distance functions (dense & sparse versions)
- central terms extractionThe attached code example (also in the project readme) uses this package to creeate a clustering of all ~210 #ThingUmbrella packages, based on their assigned tags/keywords...
The library is not intended to be a full-blown NLP solution, but I keep on finding myself running into these functions/concepts quite often, and maybe you'll find them useful too...
#Text #Analysis #Cluster #KMeans #TFIDF #Ngram #Vector #TypeScript #JavaScript
-
Fellow finicky writers: Do you prefer "advance notice" or "advanced notice"?
Both are attested. But FYI, #ngram says that "advance notice" is much more common, even if it's in decline.
https://books.google.com/ngrams/graph?content=advance+notice%2C+advanced+notice&year_start=1800&year_end=2022&corpus=en&smoothing=3 -
🚀 Spider v0.8.0
New features include:
"-file" to generate n-grams from local plaintext files
"-timeout" for URL crawling
"-sort" to output n-grams by frequency
https://forum.hashpwn.net/post/52
#spider #webcrawler #wordlist #ngram #infosec #hashcracking #golang #hashpwn
-
Слушать некогда читать: где поставим запятую?
Узнаете, когда заглянете под кат.😉 Для затравочки: речь пойдёт про инструмент ЮMoney для транскрибации аудио с внутренних созвонов в тексты и про кое-что ещё для наших клиентов. 😎👇
https://habr.com/ru/companies/yoomoney/articles/896096/
#whisper #llmмодели #искусственный_интеллект #ai #саммаризация #диаризация #идентификация #транскрибация_звонков #ngram
-
#Google Books Is Indexing #AI-Generated Books
👉 #GoogleBooks is indexing low quality, AI-generated books that will turn up in search results, and could possibly impact Google #Ngram viewer, an important tool used by researchers to track #language use throughout history.
#GoogleNgram #NgramViewer #linguistics #diachrony #diachroniclinguistics #research #languages #aigeneratedcontent #AIgeneratedBooks
-
Google Books reportedly indexing bad AI-written works https://www.theverge.com/2024/4/5/24122077/google-books-ai-indexing-ngram
-
#HowToThing #030 — Procedural, rule-based & stochastic text generation using a custom DSL, parse grammar (via https://thi.ng/parse) and abstract syntax tree transformation (via https://thi.ng/defmulti).
Since it's #NaNoWriMo & #NaNoGenMo [1], I'm closing out this first season of 30 #HowToThing's with a related topic & maybe someone even finds it useful/interesting... 😉🤷♂️
This example is in principle inspired by @galaxykate's oldie & goodie #Tracery, but is using a super simple custom text format instead of JSON to define variables and template text. Variables are expanded recursively and I've also added features like dynamic, indirect pointer-like variable lookups to derive variables based on current values (useful for conditionals & context-specific expansions), hidden assignments, chainable modifiers... I've included 5 different "story" templates (incl. comments) showing various features. Just press "regenerate" to create new random variations...
Similar to the previous #HowToThing, I'm hoping this example also shows that approaching use cases like this via small domain-specific languages with proper grammar rules, does not require much ceremony and is often more amenable to change during prototyping (and later also more maintainable!) than just regex bashing approaches...
The parser grammar itself is explained in the https://thi.ng/parse readme. As usual, the grammar was created/prototyped with the Parser Playground[2], which we developed from scratch during the first thi.ng livestream[3] (2.5h video)...
Demo (example project #145):
https://demo.thi.ng/umbrella/procedural-text/Source code:
https://github.com/thi-ng/umbrella/tree/develop/examples/procedural-text/srcIf you have any questions about this topic or the packages used here, please reply in thread or use the discussion forum (or issue tracker):
https://github.com/thi-ng/umbrella/discussions
[1] https://github.com/NaNoGenMo/2023/
[2] https://demo.thi.ng/umbrella/parse-playground/
[3] https://www.youtube.com/watch?v=mXp92s_VP40#ThingUmbrella #NaNoWriMo2023 #NaNoGenMo2023 #ProcGen #Generative #TextGeneration #Ngram #TypeScript #JavaScript #Tutorial
-
I am sure I have encountered the phrase before but it made me wonder: why not just say "it lasted five seconds" or "it lasted almost five seconds" -- are there certain instances where this sort of phrase is more likely than reference to a unit of time?
And fun that I immediately registered it as a contemporary act as well (one Mississippi, two Mississippi), and one of the first hits for me in Google Books involved keeping musical time.
For (wholly imprecise) fun, sharing a google #NGram chart screenshot, full version here: https://books.google.com/ngrams/graph?content=while+one+might+count%2Cwhile+one+could+count&year_start=1800&year_end=2019&corpus=en-2019&smoothing=3
-
#DMRG vs. #QMC. Problem is, I think QMC has many meanings!
#DensityMatrixRenormalizationGroup #QuantumMonteCarlo #NGram #statistics
-
Some interesting #physics #statistics here. These are Google Books #NGram plots for the following terms:
#QuantumEntanglement
#CorrelationFunction
#Correlation
#CorrelationCoefficient
#ConnectedCorrelationFunction
#ConnectedCorrelation
#DisconnectedCorrelation
#ClassicalCorrelations
#QuantumCorrelations
#LongRangeOrder
#ShortRangeOrder -
Catching up on the #SPP2023 #preconference on #memory:
Felipe De Brigaard introduced us to the topic and some recent trends before a series of talks ensued.
Find Felipe's work on gScholar: https://scholar.google.com/citations?user=l9gS2joAAAAJ&hl=en&oi=ao
-
One of the basic questions we tackle when working towards statistical language models is "Can we predict a word?"
This was also one of the intro questions to the students last Wednesday in our #ise2023 lecture no.4, when we were introducing simple n-gram language models.#nlp #lecture #ngram #languagemodels #language #aiart #stablediffusion #creativeAI @fizise @KIT_Karlsruhe @nfdi4ds @nfdi4culture
-
@unefamilleavelo À vélo, certes, mais pour le scooter le doute est permis : il est tout de même plus "environnant". Il est vrai que tant le TLF que le Petit Robert ne citent que des "à scooter", mais d'après #Ngram cela n'a jamais été l'usage majoritaire !
-
@fotis_jannidis macht in seinem Vortrag auf der #DHd2023 deutlich, dass Untersuchungen mithilfe des Google #NGram Viewers seit 1998 nicht zielführend sind aufgrund der undurchsichtigen und nicht repräsentativen Korpuszusammensetzung.
GRIN-Verlag, Self-publishing und Retrodigitalisierung wirken bei Analysen als drei identifizierbare Störfaktoren. -
Importing the #Google #ngram #Data set into #PostgresSQL.
I'm almost done with the bi-grams.
I've got about ~900GB more to import, then it's on to the tri-grams.
This is an entire, unfiltered set, that I'm going to backup first and put in cold storage.
Then I'm going to filter out rows that have characters that aren't allowed in #HashTags. This is the dataset that will power #FediMod's hashtag #accessibility service.
-
The Google corpus of edited text shows a big pre-COVID spike in one word starting in 2012. But by 2019, hand washing and handwashing were equally likely.
-
Wasn da los in letzter Zeit?
Google #Ngram Viewer: '[ich]', '[der]', 1800-2019 in German. https://books.google.com/ngrams/graph?content=ich%2Cder&year_start=1800&year_end=2019&corpus=31&smoothing=0&
-
-
#Google Books Is Indexing #AI-Generated Books
👉 #GoogleBooks is indexing low quality, AI-generated books that will turn up in search results, and could possibly impact Google #Ngram viewer, an important tool used by researchers to track #language use throughout history.
#GoogleNgram #NgramViewer #linguistics #diachrony #diachroniclinguistics #research #languages #aigeneratedcontent #AIgeneratedBooks
-
#Google Books Is Indexing #AI-Generated Books
👉 #GoogleBooks is indexing low quality, AI-generated books that will turn up in search results, and could possibly impact Google #Ngram viewer, an important tool used by researchers to track #language use throughout history.
#GoogleNgram #NgramViewer #linguistics #diachrony #diachroniclinguistics #research #languages #aigeneratedcontent #AIgeneratedBooks
-
#Google Books Is Indexing #AI-Generated Books
👉 #GoogleBooks is indexing low quality, AI-generated books that will turn up in search results, and could possibly impact Google #Ngram viewer, an important tool used by researchers to track #language use throughout history.
#GoogleNgram #NgramViewer #linguistics #diachrony #diachroniclinguistics #research #languages #aigeneratedcontent #AIgeneratedBooks
-
#Google Books Is Indexing #AI-Generated Books
👉 #GoogleBooks is indexing low quality, AI-generated books that will turn up in search results, and could possibly impact Google #Ngram viewer, an important tool used by researchers to track #language use throughout history.
#GoogleNgram #NgramViewer #linguistics #diachrony #diachroniclinguistics #research #languages #aigeneratedcontent #AIgeneratedBooks
-
BILD-Bürgerstreiche & das Ende der Spaßgesellschaft
Diebstahl lohnt sich manchmal doch. Ist aber sonst ziemlich verboten. bild zeitung Ich weiß nicht mehr, wo ich das mal unterwegs gesehen und aufgenommen habe. Und ich frage mich, was für ein elend, dass heutzutage Menschen gezwungen sind, eine BILD am Sonntag stehlen zu müssen, es sei denn, … Das Ende der Spaßgesellschaft haben wir bereits von annähernd 20 Jahren erlebt. Ich habe mal den Ngram-Viewer von Google nach Fundstellen in Büchern und Zeitschriften befragt. Zugleich mit […]https://www.kritische-masse.de/logbuch/2026/03/bild-buergerstreiche-das-ende-der-spassgesellschaft/
-
les exercices sur #ngram sont chouettes aussi pour apprendre l'#ergol mais seulement je fois qu'on a bien potassé la mémoire mécanique ailleurs.
un an après avoir commencé à être en mesure de faire des phrases en dactylo en ergol, j'ai toujours du mal sur les enchaînements d'annuaire et auriculaire, mais que de la main droite.
probablement une histoire d'appui de longue date concernant une habitude de dessin où ces doigts me servaient surtout de support pour préciser mon trait.
-
les exercices sur #ngram sont chouettes aussi pour apprendre l'#ergol mais seulement je fois qu'on a bien potassé la mémoire mécanique ailleurs.
un an après avoir commencé à être en mesure de faire des phrases en dactylo en ergol, j'ai toujours du mal sur les enchaînements d'annuaire et auriculaire, mais que de la main droite.
probablement une histoire d'appui de longue date concernant une habitude de dessin où ces doigts me servaient surtout de support pour préciser mon trait.
-
les exercices sur #ngram sont chouettes aussi pour apprendre l'#ergol mais seulement je fois qu'on a bien potassé la mémoire mécanique ailleurs.
un an après avoir commencé à être en mesure de faire des phrases en dactylo en ergol, j'ai toujours du mal sur les enchaînements d'annuaire et auriculaire, mais que de la main droite.
probablement une histoire d'appui de longue date concernant une habitude de dessin où ces doigts me servaient surtout de support pour préciser mon trait.
-
les exercices sur #ngram sont chouettes aussi pour apprendre l'#ergol mais seulement je fois qu'on a bien potassé la mémoire mécanique ailleurs.
un an après avoir commencé à être en mesure de faire des phrases en dactylo en ergol, j'ai toujours du mal sur les enchaînements d'annuaire et auriculaire, mais que de la main droite.
probablement une histoire d'appui de longue date concernant une habitude de dessin où ces doigts me servaient surtout de support pour préciser mon trait.
-
Recently I've combined various functions which I've been using in other projects (e.g. my personal PKM toolchain) and published them as new library https://thi.ng/text-analysis for better re-use:
- customizable, composable & extensible tokenization (transducer based)
- ngram generation
- Porter-stemming & stopword removal
- vocabulary (bi-directional index) creation
- dense & sparse multi-hot vector encoding/decoding
- histograms (incl. sorted versions)
- tf-idf (term frequency & inverse document frequency), multiple strategies
- k-means clustering (with k-means++ initialization & customizable distance metrics)
- similarity/distance functions (dense & sparse versions)
- central terms extractionThe attached code example (also in the project readme) uses this package to creeate a clustering of all ~210 #ThingUmbrella packages, based on their assigned tags/keywords...
The library is not intended to be a full-blown NLP solution, but I keep on finding myself running into these functions/concepts quite often, and maybe you'll find them useful too...
#Text #Analysis #Cluster #KMeans #TFIDF #Ngram #Vector #TypeScript #JavaScript
-
Recently I've combined various functions which I've been using in other projects (e.g. my personal PKM toolchain) and published them as new library https://thi.ng/text-analysis for better re-use:
- customizable, composable & extensible tokenization (transducer based)
- ngram generation
- Porter-stemming & stopword removal
- vocabulary (bi-directional index) creation
- dense & sparse multi-hot vector encoding/decoding
- histograms (incl. sorted versions)
- tf-idf (term frequency & inverse document frequency), multiple strategies
- k-means clustering (with k-means++ initialization & customizable distance metrics)
- similarity/distance functions (dense & sparse versions)
- central terms extractionThe attached code example (also in the project readme) uses this package to creeate a clustering of all ~210 #ThingUmbrella packages, based on their assigned tags/keywords...
The library is not intended to be a full-blown NLP solution, but I keep on finding myself running into these functions/concepts quite often, and maybe you'll find them useful too...
#Text #Analysis #Cluster #KMeans #TFIDF #Ngram #Vector #TypeScript #JavaScript
-
Recently I've combined various functions which I've been using in other projects (e.g. my personal PKM toolchain) and published them as new library https://thi.ng/text-analysis for better re-use:
- customizable, composable & extensible tokenization (transducer based)
- ngram generation
- Porter-stemming & stopword removal
- vocabulary (bi-directional index) creation
- dense & sparse multi-hot vector encoding/decoding
- histograms (incl. sorted versions)
- tf-idf (term frequency & inverse document frequency), multiple strategies
- k-means clustering (with k-means++ initialization & customizable distance metrics)
- similarity/distance functions (dense & sparse versions)
- central terms extractionThe attached code example (also in the project readme) uses this package to creeate a clustering of all ~210 #ThingUmbrella packages, based on their assigned tags/keywords...
The library is not intended to be a full-blown NLP solution, but I keep on finding myself running into these functions/concepts quite often, and maybe you'll find them useful too...
#Text #Analysis #Cluster #KMeans #TFIDF #Ngram #Vector #TypeScript #JavaScript
-
Recently I've combined various functions which I've been using in other projects (e.g. my personal PKM toolchain) and published them as new library https://thi.ng/text-analysis for better re-use:
- customizable, composable & extensible tokenization (transducer based)
- ngram generation
- Porter-stemming & stopword removal
- vocabulary (bi-directional index) creation
- dense & sparse multi-hot vector encoding/decoding
- histograms (incl. sorted versions)
- tf-idf (term frequency & inverse document frequency), multiple strategies
- k-means clustering (with k-means++ initialization & customizable distance metrics)
- similarity/distance functions (dense & sparse versions)
- central terms extractionThe attached code example (also in the project readme) uses this package to creeate a clustering of all ~210 #ThingUmbrella packages, based on their assigned tags/keywords...
The library is not intended to be a full-blown NLP solution, but I keep on finding myself running into these functions/concepts quite often, and maybe you'll find them useful too...
#Text #Analysis #Cluster #KMeans #TFIDF #Ngram #Vector #TypeScript #JavaScript
-
Fellow finicky writers: Do you prefer "advance notice" or "advanced notice"?
Both are attested. But FYI, #ngram says that "advance notice" is much more common, even if it's in decline.
https://books.google.com/ngrams/graph?content=advance+notice%2C+advanced+notice&year_start=1800&year_end=2022&corpus=en&smoothing=3 -
Fellow finicky writers: Do you prefer "advance notice" or "advanced notice"?
Both are attested. But FYI, #ngram says that "advance notice" is much more common, even if it's in decline.
https://books.google.com/ngrams/graph?content=advance+notice%2C+advanced+notice&year_start=1800&year_end=2022&corpus=en&smoothing=3