The Luddy School of Informatics, Computing, and Engineering’s Yong Yeol Ahn, professor of Informatics and Computing, and Staša Milojević, professor of Informatics, are among the co-authors of a paper set to appear in a new PNAS publication.
The paper, “Unsupervised embedding of trajectories captures the latent structure of scientific migration,” uses a machine-learning technique to better understand and explain why people, including scientists, migrate. This can provide insight into the spread of people and ideas.
The researchers used a machine-learning technique called “word2vec,” which was originally developed as a language model, to study how people move around the world.
Co-first authors are former Luddy Informatics Ph.D student Dakota Murray, former Luddy visiting scholar Jisung Yoon, and former Luddy postdoc fellow Sadamori Kojaku. Murray is a research assistant professor at Northeastern University; Yoon is a postdoctoral fellow at Kellogg School of Management at Northwestern University; and Kojaku is an assistant professor at Binghamton University.
Human migration and mobility drive major societal phenomena such as epidemics, economies, innovation, prestige and the diffusion of ideas. While mobility and migration are constrained by geographic distance, technical advances and globalization make factors such as language and culture more important.
Murray said measuring the geographic distances between two places is not necessarily a good way to show how far apart they really are. You have to take into account many other factors, such as language.
For example, driving 40 miles from one city to another might take an hour if the road is flat, or two hours if you have to go around a mountain and take dirt roads.
“Even though Quebec is geographically close to New England,” Murray said, “a scientist in Montreal is more likely to move for a new job in Vancouver or Paris than Boston.
“The method we use is able to learn these effective distances directly from data of how people move.”
Applying word2vec to scientific migration through a database of three million migration trajectories of scientists allowed researchers to better understand how culture, language and prestige influences migration. The study also provided a theoretical foundation and methodological framework for using embeddings to represent and understand migration within and outside of science.
“I am excited about new possibilities this paper opens up for understanding scientific enterprise and innovation,” Milojević said. “Unlike previous methods on migration of researchers and scholars, which have focused on a single facet of migration at a time, our approach manages to represent multi-faceted nature of this phenomenon simultaneously capturing factors such as geography, language, culture, history, economic opportunity, and even prestige.”
Murray said this method can be considered as a “digital double” of mobility data that represents global scientific migration, a complex phenomenon, into a smaller and easier-to-use form.
He said Yoon and Kojaku made a theoretical connection between word2vec and the gravity law of human mobility.
“Migrations are more common when two places are bigger and closer together,” Murray said. “What we find is that word2vec is mathematically equivalent to a gravity model. Give it the migratory trajectories of scientists -- or any other migrations -- and it will learn a space where places are arranged so that they follow the “gravity” law.
“Empirically, we can use the representation we get at the end to study migration. For example, we used it to show that, despite their great distance, there is language-driven affinity between France and in French-speaking Quebec, and between Portuguese-speaking Brazil and Portugal,” Murray said.
Added Ahn: “This particular example may be somewhat obvious, but the representation space encodes rich structure with some surprises. For instance, we can identify an ‘axis’ in this space that encodes academic prestige of institutions.”
Murray said a big problem with machine-learning tools, such as ChatGPT, is that no one really understands how they work or what’s happening, “under the hood.”
“By studying word2vec, an early precursor to these more advanced machine-learning techniques, we discover that it is actually modeling something really simple, yet profound. In the process, we shed light on the inner workings of modern machine learning tools.”
Ahn said it was fascinating to find a hidden bridge between two completely difference subjects – a language model and a law of mobility.
“I think it also highlights the importance of understanding the representation space that AI models produce," he said. "By better understanding the representation space – not only the output of AI models – we may be able to better understand and more concretely interpret AI models.”
PNAS (Proceedings of the National Academy of Sciences) is a 110-year-old peer-reviewed journal of the National Academy of Sciences. It’s an authoritative source of high-impact, original research that spans the biological, physical and social sciences. Its global scope encompasses researchers from around the world.