We’re growing as a NLP company and say "Hello" to Marina Souto, our new Machine Learning Engineer

We are proud to announce that Pangeanic is expanding! With Business Interactive Japan Takeover and launching a series of projects focused on new market needs. At Pangeanic, we aim to put forward unique alternatives and global solutions for natural language processing.

Our Artificial Intelligence research department is at the heart of this expansion. Currently, this team of professionals is working on several exciting lines of research using Natural Language Processing (NLP). These include projects on neural machine translation (with more than 250 language pairs), information retrieval, named entity recognition, as well as classification, summarization or text generation.

We're also proud to announce that Marina Souto, a Machine Learning Engineer, has joined us at this pivotal time. She will apply her knowledge of advanced AI to aid the development of new AI-based tools. These will help Pangenic and the team continue to learn about natural language processing.

We talked to Marina Souto to learn more about her work and what she will be doing at Pangeanic.

1. Tell us about your experience with NLP and what you would like to work on at Pangeanic?

Since I learned about NLP during my master's degree, I have built small text generation projects using LSTM networks, topic modeling, and text classification. During my time at Pangeanic, I hope to increase my knowledge of transformer architectures and machine translation.

2. What are your thoughts on data science generally and AI as an aid or substitute for mundane tasks now performed by people?

I think data science is needed everywhere you get data. Artificial intelligence relies on a lot of data, and therefore data science is needed to store, manage and clean all that information. In my view, people are good at finding creative solutions and doing unique jobs, and they should focus on that and let AI do repetitive and mundane tasks.

3. Applied to language, AI has developed exciting new technologies, which do you admire the most?

Machine translation, because it has a broader impact. Although the Internet has made knowledge available to everyone, machine translation allows everyone to understand it without the need to speak English.

Recommended Reading: Learn more about Pangeanic's Language Technology

4. It's tricky to predict what the future may bring, but do you think a world where virtual "alter egos", able to learn from everything we write or say, is possible?

I guess so. Nowadays, I find some apps on my phone that suggest the words and expressions I usually use, in addition to ads that recommend products I have searched. Our online profiles are increasingly complex and, even if they don't replace us, they will know us very well.

About “Bias in machine learning”:

5. What is bias in machine learning and why is it a problem?

In machine learning, bias refers to systematic error. Typically, this type of error is due to the distribution of the data used for the model. For example, if you try to predict an ideal birthday gift with data on children alone, all predictions will be biased towards childrens’ toys. Another perhaps worrying example is, if you try to predict what type of person would be an ideal boss for a company based on previous bosses, your model will probably skew to male candidates if most of the previous bosses were men.

6. What does a biased AI output look like?

A biased output will reflect the data you have used and will give good results for that sample, but will not show a good result for other groups. Today, society has changed its views on some issues. It is important to remember that old data may not reflect new beliefs.

7. Does bias occur in Pangeanic when setting up linguistic models for neural machine translation?

For machine learning translation as a whole, there is a gender bias problem. In different languages, gender affects distinct parts of a sentence, and for some translations, it's challenging to maintain the original gender or connotation neutral.

8. Why does bias occur? So, if the root of all problems is in the data. How exactly does the ML model become biased?

When training a model, you give it a problem and an answer. For example, take the issue of hiring the best candidate, if all the answers share attributes that point to the best candidate being a man or white, it's easy for the model to associate those characteristics with the best answer. This exact problem happened to Amazon’s automated resume screening in 2015, which was proven to be discriminating against women.

9. You’re passionate about using data for good. What's your first-hand experience with bias in data labeling? Is there a solution for the bias issue in machine learning? What can companies do to ensure more fairness in their ML models?

First of all, we need to obtain data that reflects the present situation and belief system and, if that's not possible, it should at least be taken into account as something to be aware of. Secondly, test the model in the population you are aiming it for and look at the potential problems that arise. Lastly, sometimes it is best to let go of metrics (such as accuracy or sensibility) and focus on the real-world consequences and evaluate the ethical dilemmas.

10. Why is it so important to be aware of bias in ML?

Because the predictions of the model will have real-life implications, that can further impose prejudices in society. Sometimes data can give a narrow perspective of reality and mistake correlation for causation. For example, the larger the quantity of firemen/firewomen in a fire, the greater the damage. Although that is true, the firemen do not cause the damage, but the larger the fire, the greater the number of firemen/firewomen who will attend. A shallow understanding of the data and the issue can end up reinforcing obsolete ideas.