PangeMatic (Pangeanic’s own machine-translation engine) begins operation. A TMX MT workflow has been established initially in Spanish, French and...
Alex Helle is Chief Operations for the open-sourcing of Pangeanic’s ActivaTM into the National and European Central Translation Memory. With the Beta version now available, we share an interview with Alex to find out more about the NEC TM project and how Member States can benefit from it.
What exactly are the aims of NEC TM?We want to provide the tool with which European Administrations will organise their translation procurement and, in parallel, create national linguistic assets and bilingual data. By having a central repository where Public Administrations can run fuzzy matching and centralise their translation memories, they not only save money but also have a digital infrastructure where all the bilingual text data created through translation procurement contracts is stored. This can be shared at different levels or not. Several Administrations can have different deployments. The point is that each Member State can increase this “national language treasure” with every translation contract, and this can be done on-the-fly or at the end of a translation contract. In short, NEC TM provides a centralised infrastructure for efficient data sharing, TM matching, TM retrieval, and domain categorisation of resources generated in Member States/ EEA, with an emphasis on countries with low language resources. This will enable the development of NEC TM, which will be an open source software developed from Pangeanic’s translation memory database ActivaTM.
What are the benefits of NEC TM?The benefits of NEC TM are as follows:
- Unified TM: NEC TM is CAT agnostic, so it can be used from any CAT tools used by the Translation departments of the Member States/EEA or the external providers.
- Open Source: Pangeanic will turn this commercial software into GPL (open source General Public License) and customise it for free, to be used by Public Administrations.
- Solid framework: NEC TM will also provide a centralised infrastructure for efficient data sharing, TM matching, TM retrieval, and domain categorisation of resources generated in the Member States/ EEA
- Lower translation costs: The NEC TM Data consortium’s objective is to organise unexploited national bilingual assets that can be used as open data and general data for machine learning, in order to lower translation costs at a national level and across member states. It will gather translation memories from previous national contract awards from Member States and help them to centralise these language assets with the fast-performing NEC TM
- Data bridge: NEC TM will allow the Public Administrations to share data with themselves and with their translation contractors
How was this project conceived?The EC’s Programme was quite clear on the objectives: data gathering and language tools. We believed an initiative like NEC TM could fulfil the language tool option as it empowers Public Administrations to gather data which otherwise is lost and remains ins silos, at translation companies’ internal servers or PCs. European Public Administrations are losing valuable assets they pay for with public money because they simply lack the tool to organise the repositories (live or as TMX after the translation contract is over). In reality, most translation companies run translation servers in one way or another. The point here was to come up with a robust solution that could be implemented at a national level. However, NEC TM will not be implemented if we do not know the size of the expenditure in each country. We can’t provide the cure if we do not know there is a “problem”. I don’t like to call it a problem, because it is just the level of expenditure, but it’s difficult to organise something if we do not know the size of it. So, in parallel to the software development, half of our project is devoted to a market study, country by country that will help public institutions and the EC itself to understand the size of the public expenditure country by country. This report will be the basis for NAPs to speak to relevant authorities and push for national adoption. There are strong dissemination efforts in 3 European areas: September in Zagreb for Central Europe and the Balkans, Spain, Malta and Poland as national dissemination, and Northern region in Latvia, co-hosting with ELRC. We will also co-host in France and Luxembourg to maximise influence and awareness about market size and the advantages of a national translation memory.
How has Pangeanic helped reached this milestone for NEC TM?The NEC TM Data proposal includes the provision of a central TM-sharing repository, called the NEC TM Data platform. The platform will be based on Pangeanic’s commercial tool ActivaTM and it works on a similar concept using industry practices as used by other commercial tools and private organizations such as Memsource, TAUS, etc. Pangeanic will turn this commercial software into GPL (open source General Public License) and customise it for free use for Public Administrations.
How different is this software to the one implemented by other projects?NEC TM emphasizes fuzzy matching and leveraging for the translation departments and its own translators. For the scope of the project, plugins for different CAT tools will be provided so the Translation Project Managers or the translator can use NEC TM directly Moreover, access to the tool can be live, so translators feed the national repository as they work. We are offering a “live” tool, not a static repository. ELRI, for instance, will be a collection of bilingual assets, from which a TMX is created for translators to work.
What are the future steps?We are half-way through the project, these are exciting times… a lot of working ahead. We want to
- To identify national administrations translation contractors from public sources (Official Gazzetes) to create a pan-European report identifying the sector contractors, main contractors and main contracts in Member States.
- To set a secure legal framework for PPAA and vendors to share data (IP clearance).
- To closely collaborate with ELRC so that Info Days for PPAA become part of the conference agenda and at translation organizations’ agenda to disseminate information about the data creation, flow and gathering initiative as well as the legal framework.
- To create plugins for different CAT tools used by the PPAA
What type of license/ hardware will be used to implement NEC TM?NEC TM will be GPL (open source General Public License) and free use by Public Administrations. This is a small summarization of the hardware and software requirements:
- RAM: 64GB recommended, 16GB minimum
- CPU: Not important
- Disk: SSD of 1TB recommended, 256GB minimum
- SO: Ubuntu 16.04 recommended or later
Is there anything else in particular that is needed in order to launch NEC TM?NEC TM will work through Docker, a simple operating-system-level virtualization which works on Linux, Windows and MacOS. The usage of Docker enables the easy installation and update of NEC TM.
In which countries will the beta version be launched?Early adopters are Spain, Malta and Croatia, with Slovenia coming close. It is already in use in Latvia as part of the Hugo.lv project. The dissemination activities will help us introduce NEC TM in more Member States. This is an ongoing discussion with NAPs.
Carmen Herranz-CarrCarmen is a Data Analyst currently working for the NEC TM project in which she forms part of a team collecting and managing translation data information across the EU. Her interests lie in AI, social affairs, and language technologies.
The Japanese Technical Communicators Association (JTCA) has invited Pangeanic's CEO Manuel Herranz as an invited guest speaker to the TC Symposium...