• pinkapple@lemmy.ml
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    11 hours ago

    via mechanisms including scraping, APIs, and bulk downloads.

    Omg exactly! Thanks. Yet nothing about having to use logins to stop bots because that kinda isn’t a thing when you already provide data dumps and an API to wikimedia commons.

    While undergoing a migration of our systems, we noticed that only a fraction of the expensive traffic hitting our core datacenters was behaving how web browsers would usually do, interpreting javascript code. When we took a closer look, we found out that at least 65% of this resource-consuming traffic we get for the website is coming from bots, a disproportionate amount given the overall pageviews from bots are about 35% of the total.

    Source for traffic being scraping data for training models: they’re blocking javascript therefore bots therefore crawlers, just trust me bro.