[insert period joke]

Google Autocomplete across languages: most queried word combinations with “I’m having my period”.

This research was part of the DMI Summer School project I did in week two: Menstrual Issues Across Language Spacess

Collaborators:

Astrid Bigoni | Zuzana Karascakova | Emily Stacey | Sarah McMonagle

with a big thanks to:

Federica Bardelli  (designs)| Giulia de Amicis (designs) | Han-Teng Liao (technical and linguistic advice)

Research Question

What questions do women ask google regarding menstruation?

  • what are common queries across languages?
  • what topics are unique and only occur in a single language?
  • how are languages linked through queries about menstruation?

Dataset

Google’s database of Google Autocomplete suggestions per language (and country).

Where the predictions come from (source: support.google.com):

As you type, autocomplete predicts and displays queries to choose from. The search queries that you see as part of autocomplete are a reflection of the search activity of all web users and the content of web pages indexed by Google.”

Tools

The DMI’s Google Autocomplete tool https://wiki.digitalmethods.net/Dmi/ToolGoogleAutocomplete

Read more for method, operalization, findings and all the datavisualisations

Infographic: Top Queries Across Languages, By Size

Infographic: Categories With Example Queries

Infographic: All Countries and Categories (excl UK)

Method & Operationalization

Query design:

We entered the query “I’m having my period” (and its respective translation in selected languages) in the google search engine.

There is a large range of terms that refer to periods, from clinical references (e.g. “menstrual cycle”) to slang (e.g. “being on the rag”, “aunt flo(w)”). Many languages have a mainstream, popular term such as “period” (en) or “ongesteldheid” (nl).

Operationalization

  • query equivalents of “I’m having my period” (as opposed to “I’m menstruating” which is less colloquial) in a number of languages and local googles (e.g. Dutch | google.nl, German | google.de etc);
  • record the top ten google autocomplete suggestions in a spreadsheet and translate them to English;
  • manually post-code findings in categories (e.g. period and…pregnancy, contraceptive pill, pain etcetera);
  • see which coded categories occur across languages
  • find unique categories (i.e. those that occur in one language only)
  • visualize unique categories in a matrix
  • visualize the common categories across languages

Findings

It could be said that disctinctive topics emerge across the different language spaces. Some language spaces share common topics in their top ten, and there are also unique issues to be found that occur in just one of our selected languages of analysis.

Unique queries that occur in the top 10 of one language only

“how do i tell my mum?” in Russian

premenstrual symptoms in traditional Chinese

“I’m having my period video” and I’m having my period in my dream” in English.

Shared in some languages

Subtopics relating to “sex” occur only in the top tens of traditional Chinese for Hong Kong and Spanish.

The subtopic “what do I do” is only in the top ten queries in Dutch, Czech and Russian.

We coded one category as “activities” (such as swimming and going to the beach) which appears in English, German and Italian but not in the other languages.

Shared in many languages

Queries related to the duration (6 languages) and frequency (8 languages) of menstruation, as well as contraceptive pills (4 countries) and pregnancy (8 languages) are queried in many language spaces.

The perks of a more detailed view

By grouping related sub-topics some nuanced differences become invisible. In German, for example, the query “I’m having my period in Turkish” stands out next to other queries in the subtopic “Translations” which often point to “I’m having my period in English”.

Limitations and methodological issues

  • The cut-off is automatically set to 10 since the autocomplete tool only extracts the top ten queries. It could be worth looking at more than just the top ten. We’ve found in other projects that the interesting stuff is not always near the top of the conversation.
  • We queried only 12 languages, which might give a very distorted view.
  • Some languages don’t give autocomplete suggestions at all and/or are not available in the DMI tool. We found this for Papiamento and Irish, for example
  • In some languages, the equivalent for ‘period’ poses problems for Google. In Hungarian for example, the equivalent would be “I just got it”, or “it just arrived”. According to a native speaker, the term “menstruation” is hardly ever used. These terms are so unspecific to google that it delivers very few and very vague results. In this case just one result: “im just here looking around” with a history of 34700 queries.
  • For Russia and China, google is not the dominant search engine. Also, China has no local google domain, so our search query for traditional Chinese was carried out in google Hong Kong.
  • In some cases, e.g. Slovak, the query returned only 6 results instead of a top ten.

Suggestions for discussion and further research

  • Why does a query sometimes return fewer than ten autocomplete results? (see the example of Slovakia)
  • The numbers are hard to compare because all language spaces are not created equally: a language like Dutch is quite confined to the Netherlands and Flanders whereas English is used by many more people (especially online). Can we even compare these numbers? How?
  • The distinction between colloquialisms and slang is not always clear-cut, to what extent can you evaluate what is equivalent across languages if you’re not a native speaker in both?
  • The subtopic “what do I do” is only in the top ten queries in Dutch, Czech and Russian. None of the other countries. Can it be said that this indicates anything about awareness or sex education? When can we say that it indicates levels of awareness and/or education about the topic?
  • Querying more languages will give a more complete overview.
  • The subtopics the appear in many languages but not in all are interesting to research further. E.g. queries relating to frequency emerges in eight out of 12 languages but not in Dutch for example. Where is the subtopic of period frequency in the Dutch language space if not in the top ten?

Data, Projects