We all have the experience that by looking at the name of a person, we can guesss a lot of imformation about the person with regard to the gender and ethnicity. In our earlier work, we found that name embeddings can be used features for gender, ethnicity and nationality classification.
In this work, we argue that the gender information is encoded in the spelling of the names.
For example, if someone's name end with "a", very likely that person is female (left side of the following figure). However, if the second last character is "u", that person is likely a male (think of "Josuha", see the blue square in the bottom row of the right side of the figure).
We propoed a number of character based ML models which shows great performance. Models trrained on large Yahoo data also extends well to data from the Social Security Admin (Table below).
While the gender is typically conveyed in the first name, there are cases when the same first name can be used for one gender in one culture, but another in a different culture. For example, the Italian male name â"Andrea" (derived from the Greek â"Andreas") is considered a female name in many languages, such as English, German, Hungarian, Czech, and Spanish. We found that combining the first name with the last name can help to disambiguate in such situations.
Interestingly, since the character-based models can be applied to any string, we can find the most masculine and feminine names in S&P 500 companies (Table below).
Reference: Yifan Hu, Changwei Hu, Thanh Tran, Tejaswi Kasturi, Elizabeth Joseph, Matt Gillingham, What's in a Name? -- Gender Classification of Names with Character Based Machine Learning Models, to appear in the journal of Data Mining and Knowledge Discovery, 2021.