home.social

#corpora — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #corpora, aggregated by home.social.

  1. Later today at #CHR2024, we are going to present our work on #Multilingual #Stylometry!

    We isolated the influence of #language on #authorship #attribution #accuracy by translating multiple #corpora into each others' languages while keeping #corpus composition stable.

    Interactive showcase: showcases.clsinfra.io/stylomet

    Full paper: ceur-ws.org/Vol-3834/paper9.pd

    This work was developed within the @CLSinfra project in #Trier, #Krakow and #Prague with Artjoms Šeļa, Evgeniia Fileva and Julia Dudar.

  2. So you wanna parse/manipulate some #PDF's, huh!?

    Well, you better #test your #software thoroughly or bad things will happen!🧪

    So how about "this corpus [which] contains nearly 8 million PDFs gathered from across the web in July/August of 2021":
    digitalcorpora.org/corpora/fil

    The entire corpus when uncompressed takes up nearly 8 TB!

    You can find some more links to different #corpora (even to ones deemed #unsafe!😬) at pdf-association's Github:

    github.com/pdf-association/pdf

    #Parsing #Testing

  3. 👋 Greetings! 👋

    We wanted to remind all #fakespeakers that the #Fakespeak project is still alive and kicking – especially after a long and #fakenews filled summer vacation.

    We have some great events and research output coming out in the next few months, including a #linguistics conference, #multilingual fake news #corpora, publications bringing together advanced linguistic features and #transformermodels, and a special issue in Linguistics Vanguard on the language of fake news.

    Follow along!

  4. Interestingly, very few psychologists are aware of #linguistic #corpora 📊 and their immense research potential. Platforms like CLARIN-PL offer invaluable data that can significantly enhance our understanding of human behaviour and social interactions. 🤝🗣️ It's time more of us psych folk tapped into these resources to advance our field! 🌟🔍

  5. And another one for fellow linguists interested in compiling #corpora of digital discourse: MastoScraper takes advantage of the Mastodon API to collect toots based on a keyword search.
    Here goes, feedback welcome!
    #linguistics @linguistics
    fmoncomble.github.io/mastoscra

  6. Next week, we'll be discussing how to archive and research social media data on a large scale "After Twitter". Very excited to see what comes out of this conference, and also the following data sprint delving into huge German Twitter corpora.
    dnb.de/twittertagung
    #AfterTwitter #corpora #research

  7. interesting publication on medieval Latin text corpora by @TimGeelhaar : 🔖 Geelhaar, Tim. „Hamsterrad oder Himmelsleiter? Oder warum die Digitalisierung so endlos scheint“. Application/epub+zip,application/pdf, 2024. doi.org/10.15499/KDS-005-016.

    #Latin #Neolatin #Corpora #OpenAccess

  8. #Eduhub days 2024 at #ZHAW and I cannot be there 😢 🩼

    If you go, stop by at the marketplace—in the afternoon my colleague Maren Runte will show our work on creating a learning space for working with linguistic #corpora (to be released later this year)

    #EduhubDays24 #DigitalLinguistics

    eduhubdays2024.events.switch.c

  9. #Eduhub days 2024 at #ZHAW and I cannot be there 😢 🩼

    If you go, stop by at the marketplace—in the afternoon my colleague Maren Runte will show our work on creating a learning space for working with linguistic #corpora (to be released later this year)

    #EduhubDays24 #DigitalLinguistics

    eduhubdays2024.events.switch.c

  10. #Eduhub days 2024 at #ZHAW and I cannot be there 😢 🩼

    If you go, stop by at the marketplace—in the afternoon my colleague Maren Runte will show our work on creating a learning space for working with linguistic #corpora (to be released later this year)

    #EduhubDays24 #DigitalLinguistics

    eduhubdays2024.events.switch.c

  11. #Eduhub days 2024 at #ZHAW and I cannot be there 😢 🩼

    If you go, stop by at the marketplace—in the afternoon my colleague Maren Runte will show our work on creating a learning space for working with linguistic #corpora (to be released later this year)

    #EduhubDays24 #DigitalLinguistics

    eduhubdays2024.events.switch.c

  12. CLS INFRA Training School Vienna 2024 June 10th–12th, 2024:

    ExploreCor: "Using Programmable #Corpora in #Computational #Literary #Studies"

    This intensive program covers some of the most important steps in the research cycle of CLS, focusing on “Programmable Corpora” – dynamic collections of literary texts manipulated programmatically.

    Apply now! pretix.eu/CLSINFRA-trainingsch

    Colocated with #CCLS2024, June 13-14, 2024: jcls.io/site/conference/

    @CLSinfra #CLSINFRA

  13. CLS INFRA Training School Vienna 2024 June 10th–12th, 2024:

    ExploreCor: "Using Programmable #Corpora in #Computational #Literary #Studies"

    This intensive program covers some of the most important steps in the research cycle of CLS, focusing on “Programmable Corpora” – dynamic collections of literary texts manipulated programmatically.

    Apply now! pretix.eu/CLSINFRA-trainingsch

    Colocated with #CCLS2024, June 13-14, 2024: jcls.io/site/conference/

    @CLSinfra #CLSINFRA

  14. CLS INFRA Training School Vienna 2024 June 10th–12th, 2024:

    ExploreCor: "Using Programmable #Corpora in #Computational #Literary #Studies"

    This intensive program covers some of the most important steps in the research cycle of CLS, focusing on “Programmable Corpora” – dynamic collections of literary texts manipulated programmatically.

    Apply now! pretix.eu/CLSINFRA-trainingsch

    Colocated with #CCLS2024, June 13-14, 2024: jcls.io/site/conference/

    @CLSinfra #CLSINFRA

  15. CLS INFRA Training School Vienna 2024 June 10th–12th, 2024:

    ExploreCor: "Using Programmable #Corpora in #Computational #Literary #Studies"

    This intensive program covers some of the most important steps in the research cycle of CLS, focusing on “Programmable Corpora” – dynamic collections of literary texts manipulated programmatically.

    Apply now! pretix.eu/CLSINFRA-trainingsch

    Colocated with #CCLS2024, June 13-14, 2024: jcls.io/site/conference/

    @CLSinfra #CLSINFRA

  16. CLS INFRA Training School Vienna 2024 June 10th–12th, 2024:

    ExploreCor: "Using Programmable #Corpora in #Computational #Literary #Studies"

    This intensive program covers some of the most important steps in the research cycle of CLS, focusing on “Programmable Corpora” – dynamic collections of literary texts manipulated programmatically.

    Apply now! pretix.eu/CLSINFRA-trainingsch

    Colocated with #CCLS2024, June 13-14, 2024: jcls.io/site/conference/

    @CLSinfra #CLSINFRA

  17. @ZBW_MediaTalk Das Tagungsprogramm steht (und wird bald veröffentlicht), Bewerbungen für den folgenden Data Sprint sind noch möglich! #datasprint #twitter #dataScience #corpora
    dnb.de/twitterdatasprint

  18. Oliver Watteler and Ulrike Schneider are talking about "Can I publish my social media corpus" @ #cmc2023 #corpora #socialMedia #linguistics #gdpr

  19. One of the prettiest (if not very practical) university locations in Germany. 👋 from #cmc2023 at the University of Mannheim! #corpora #linguistics #socialMedia

  20. #corpora I bet we all have some anxiety about them. How do you choose what texts to look at? How do you know when you have enough? The #DataSittersClub is back, asking those questions and more to corpus linguist Shelley Staples, while exploring pizza and the Newbery Award for youth literature. #DigitalHumanities datasittersclub.github.io/site

  21. Is there any better news to wake up to than the fact that Norway has digitized All The Books and it's no problem at all to get all their Baby-Sitters Club translations? 🤩 #DigitalHumanities #DataSittersClub #corpora

  22. Some hard facts from the British National Corpus #BNC : of the 44 hits for "oddness", 16 come from the same source, which turns out to be D. A. Cruse's "Lexical Semantics" textbook from 1986. Which makes me wonder, not for the first time, whether it has been a good idea to include linguistic texts in the BNC sampling :)
    #EnglishLinguistics #corpora #metaLinguistics

  23. If you need to wrangle with EEBO-TCP for your text analysis project, consider using the EarlyPrint project corpus. They've done a bunch of preprocessing to transform "the early English print record, from 1473 to the early 1700s, into a linguistically annotated and deeply searchable text archive." Documentation and tutorials are all really thorough. earlyprint.org/about/ humanitiesdata.com/resources/4 -tcp

  24. Have you explored the new #EighteenthCenturyPoetryArchive corpus builder yet? It allows you to quickly create and share collections of poems, editions, or lists of authors with a single link!

    eighteenthcenturypoetry.org/re

    #c18th #poetry #18thC #c18dh #ECPA #corpora #readinglists

  25. How many copies of Matthias's vacation message do we all get before someone at ELRA figures out how to filter them?

    #NLP #corpora #email

  26. Similarly, you may want a so-called #MeaningRepresentation, some structured representation of the information conveyed by a text or a linguistically motivated semantic representation of the text. These annotations are essential for my work in #NaturalLanguageGeneration / #NLG, and a big struggle for the community is coming up with ways to build #corpora for different tasks, domains, genres, or languages which also have the kinds of MRs our systems use.

  27. Some hard facts from the British National Corpus #BNC : of the 44 hits for "oddness", 16 come from the same source, which turns out to be D. A. Cruse's "Lexical Semantics" textbook from 1986. Which makes me wonder, not for the first time, whether it has been a good idea to include linguistic texts in the BNC sampling :)
    #EnglishLinguistics #corpora #metaLinguistics

  28. Some hard facts from the British National Corpus #BNC : of the 44 hits for "oddness", 16 come from the same source, which turns out to be D. A. Cruse's "Lexical Semantics" textbook from 1986. Which makes me wonder, not for the first time, whether it has been a good idea to include linguistic texts in the BNC sampling :)
    #EnglishLinguistics #corpora #metaLinguistics

  29. Some hard facts from the British National Corpus #BNC : of the 44 hits for "oddness", 16 come from the same source, which turns out to be D. A. Cruse's "Lexical Semantics" textbook from 1986. Which makes me wonder, not for the first time, whether it has been a good idea to include linguistic texts in the BNC sampling :)
    #EnglishLinguistics #corpora #metaLinguistics

  30. Some hard facts from the British National Corpus #BNC : of the 44 hits for "oddness", 16 come from the same source, which turns out to be D. A. Cruse's "Lexical Semantics" textbook from 1986. Which makes me wonder, not for the first time, whether it has been a good idea to include linguistic texts in the BNC sampling :)
    #EnglishLinguistics #corpora #metaLinguistics

  31. Similarly, you may want a so-called #MeaningRepresentation, some structured representation of the information conveyed by a text or a linguistically motivated semantic representation of the text. These annotations are essential for my work in #NaturalLanguageGeneration / #NLG, and a big struggle for the community is coming up with ways to build #corpora for different tasks, domains, genres, or languages which also have the kinds of MRs our systems use.

  32. Similarly, you may want a so-called #MeaningRepresentation, some structured representation of the information conveyed by a text or a linguistically motivated semantic representation of the text. These annotations are essential for my work in #NaturalLanguageGeneration / #NLG, and a big struggle for the community is coming up with ways to build #corpora for different tasks, domains, genres, or languages which also have the kinds of MRs our systems use.

  33. Have you explored the new #EighteenthCenturyPoetryArchive corpus builder yet? It allows you to quickly create and share collections of poems, editions, or lists of authors with a single link!

    eighteenthcenturypoetry.org/re

    #c18th #poetry #18thC #c18dh #ECPA #corpora #readinglists