We are proud to announce that Pangeanic is expanding! With Business Interactive Japan Takeover and launching a series of projects focused on new market needs. At Pangeanic, we aim to put forward unique alternatives and global solutions for natural language processing.
Our Artificial Intelligence research department is at the heart of this expansion. Currently, this team of professionals is working on several exciting lines of research using Natural Language Processing (NLP). These include projects on neural machine translation (with more than 250 language pairs), information retrieval, named entity recognition, as well as classification, summarization or text generation.
We're also proud to announce that Marina Souto, a Machine Learning Engineer, has joined us at this pivotal time. She will apply her knowledge of advanced AI to aid the development of new AI-based tools. These will help Pangenic and the team continue to learn about natural language processing.
We talked to Marina Souto to learn more about her work and what she will be doing at Pangeanic.
Since I learned about NLP during my master's degree, I have built small text generation projects using LSTM networks, topic modeling, and text classification. During my time at Pangeanic, I hope to increase my knowledge of transformer architectures and machine translation.
I think data science is needed everywhere you get data. Artificial intelligence relies on a lot of data, and therefore data science is needed to store, manage and clean all that information. In my view, people are good at finding creative solutions and doing unique jobs, and they should focus on that and let AI do repetitive and mundane tasks.
Machine translation, because it has a broader impact. Although the Internet has made knowledge available to everyone, machine translation allows everyone to understand it without the need to speak English.
I guess so. Nowadays, I find some apps on my phone that suggest the words and expressions I usually use, in addition to ads that recommend products I have searched. Our online profiles are increasingly complex and, even if they don't replace us, they will know us very well.
In machine learning, bias refers to systematic error. Typically, this type of error is due to the distribution of the data used for the model. For example, if you try to predict an ideal birthday gift with data on children alone, all predictions will be biased towards childrens’ toys. Another perhaps worrying example is, if you try to predict what type of person would be an ideal boss for a company based on previous bosses, your model will probably skew to male candidates if most of the previous bosses were men.
A biased output will reflect the data you have used and will give good results for that sample, but will not show a good result for other groups. Today, society has changed its views on some issues. It is important to remember that old data may not reflect new beliefs.
For machine learning translation as a whole, there is a gender bias problem. In different languages, gender affects distinct parts of a sentence, and for some translations, it's challenging to maintain the original gender or connotation neutral.
When training a model, you give it a problem and an answer. For example, take the issue of hiring the best candidate, if all the answers share attributes that point to the best candidate being a man or white, it's easy for the model to associate those characteristics with the best answer. This exact problem happened to Amazon’s automated resume screening in 2015, which was proven to be discriminating against women.
First of all, we need to obtain data that reflects the present situation and belief system and, if that's not possible, it should at least be taken into account as something to be aware of. Secondly, test the model in the population you are aiming it for and look at the potential problems that arise. Lastly, sometimes it is best to let go of metrics (such as accuracy or sensibility) and focus on the real-world consequences and evaluate the ethical dilemmas.
Because the predictions of the model will have real-life implications, that can further impose prejudices in society. Sometimes data can give a narrow perspective of reality and mistake correlation for causation. For example, the larger the quantity of firemen/firewomen in a fire, the greater the damage. Although that is true, the firemen do not cause the damage, but the larger the fire, the greater the number of firemen/firewomen who will attend. A shallow understanding of the data and the issue can end up reinforcing obsolete ideas.