Do robots love language? Bias and Google Translate

Translate Tongan? You'll have to ask him--Google Translate can't help.
Translate Tongan? You’ll have to ask him–Google Translate can’t help.

I tend not to follow the mainstream. I study languages that others don’t, and I’ll often gravitate towards marginal dialects when I can. When I speak Arabic, I try to throw in a little Moroccan when I can. Speaking Russian, I might add a little bit of a Ukrainian accent. Right now, I’m learning Swiss German, which I’m afraid will irritate my standard German-speaking friends.

Google Translate follows the mainstream. It is a tool developed by a savvy business filling a commercial need. People who have and spend money need an application to conduct their business more easily. I addressed the relative value of languages in an earlier post.

Unfortunately, Google Translate reflects the mainstream. It offers the languages of the powerful, and translates using the language of the status quo without respect for what is good or right independent of how things are done. For using language the way most powerful people do, Google Translate works well; those of us who seek out the margins and buck the trend of “standard” speech see clear limitations in the language and gender bias of our world reflected in this software.

Which languages are most important?

I don’t know how I missed it, but I just saw this week that Google Translate expanded into African languages a few months ago.

80 languages at Google Translate
80 languages at Google Translate

You can see that now it includes five African languages: Somali (how did I miss that?!) and Zulu, plus the three most widely-spoken languages of Nigeria (Igbo, Yoruba, and Hausa). The only other African languages offered previously were Swahili and Afrikaans from 2009.

The service follows the power structure of the Internet. You can see the stages of growth of the software in this article. Here is the general process of expansion. The first languages were all EU languages, and quickly were accompanied by ones from East Asia. After Arabic and Russian appeared, eastern European and Southeast Asian languages came next. Other Southeast Asian and Central Asian languages arrived, until the first American (Hatian Creole) and African languages were incorporated (including Afrikaans, which some may call a European language). Even though Hindi was one of the earlier languages, other Indian languages surprisingly only came at a late stage–after Latin!–and, finally, in the last stage, a group of African languages and the first Oceanic language, Maori, made it in. No indigenous languages of North or South America are yet to be represented.

I don’t believe Google would have a policy to include or exclude languages. As a successful business, they would naturally gravitate to languages that would bring the most sets of eyes to their site. Also, when they figured out a language so they could add it, adding a closely-related language wouldn’t take much additional effort. For example, adding Danish, Swedish, and Norwegian in the same release makes sense, and once Spanish is well-established, Catalan probably takes minimal effort.

Their stages of development reflect a reality of the internet and commercial value of languages. Europe and East Asia are the most important, then Asia and Southeast Asia, and finally Africa. The indigenous peoples of Oceania and the Americas are insignificant. I noticed some odd anomalies. Hundreds of millions of Indians’ mother tongues were left till quite late. I think there’s an assumption that Indians can just use English. At the same time, Welsh or Irish were added much earlier, in spite of very few monolingual speakers. For some reason, Western European languages received preference that Indian languages did not. I don’t think this is racism, however; Google reflects an economic reality in its amoral inequality of wealth and poverty.

Is Google Translate sexist?

At one time, some people accused Google Translate of gender bias. They noted that phrases that included ambiguous gender sometimes came back with a gender. Some people were scandalized because translations reflected an unwanted stereotype. For example, this article describes gender bias manifested in German. In German, Lehrer can mean a man or woman teacher, while Lehrerin is a woman teacher. I translated “physics teacher” and “math teacher” and they both used Lehrer, while “French teacher” and “cooking teacher” translate with Lehrerin, imposing a gender bias of certain areas of specialization.

I ran another experiment. In Arabic, like many other languages, there is no “it,” so one uses a masculine or feminine pronoun based on the grammatical gender of the noun. So “door” is “he,” while “car” is “she,” for example. I translated, “He fixed the car” into Arabic, and translated it back, and got the same, “He fixed the car.” When I translated, “She fixed the car” into Arabic and back, Google served up, “It fixed the car.” Maybe it is more easily imaginable that a wrench would fix a car than a woman would.

These results reflect the methodology of the translation, which is to draw from a large corpus of incidents. The author of this article interviewed an engineer working on the software who said, “Statistical patterns were used to allow the tool to determine what gender was being referred to. Should the text include the word “dice”, which is Spanish for “says”, the algorithm will not only assess the frequency that this is historically used to refer to a male or female speaker, but also the other words in the inputted text.” The software reflects how the phrase is used. It is a robot reflecting the real use of human language with stereotypes, biases, and all.

We can’t really blame the bias of the software–we can only blame our own biases. The software has no ability to understand the pragmatics of the situation. Modern Hebrew reflects the gender of the subject in present verbs. When I translated, “I am nursing the baby” or “I am giving birth,” the gender was masculine. It seems that when there is little evidence, the software defaults to masculine, even if it can’t make sense in real life. When a real bias comes out of the language, the software presents that as what you, as a “typical” speaker of the langauge, were “probably” getting at. Simply put, people talk more about women as French teachers than as physics teachers in German. Google Translate reflects our world.

Our tool in our world

I love all languages. I think we can use language to lift people up. We don’t have to marginalize languages or individuals with what and how we speak.

But our world is what it is: biased. You can make more ad revenue with some languages than with others. We tend to find fewer women working in math and science than with children. Google reflects this right back at us.

Languages rise and fall and adapt more quickly than our software. Humans can see trends coming that computers can’t. People feel right about speaking one way instead of another.

I buck the trend, though. I want to speak languages that are not money-makers. I want to find ways to focus on the marginalized rather than keep them on the margins. If I want to change the status quo, I can’t rely on Google Translate. I have to learn to speak for myself, with my own words.

Be sure to “Like” if you support the margins, those people and languages who don’t follow the trend.

Photo credit: Light Knight / Foter / Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)