Data processing in compliance with data protection regulations
A lot of data is required to train good artificial intelligence (AI) models and it is always a problem to get access to them. Many projects contain private data that we cannot get out from specified organization. And brilliant ideas that give us huge added value, such as automatic pathology detection in the healthcare sector, cannot be realized because the required data cannot be used due to data protection reasons. The examples are many and varied. Data protection has good reasons. The data should not be misused. However, it would be great if there were a way to use such data especially for projects that bring added value to society.
And here comes Federated Learning (FL), which will become a new AI business model. It solves data access problems in a smart way without violating data protection (which is important in the EU because of GDPR). With FL, we build a more accurate AI without access to data by transferring the calculations to end devices, i.e. the AI is executed where the data is and only the resulting results from different end devices are merged, forming a much more flexible and accurate AI. The data of the users of the end device are not transmitted. This provides security for the users, protects their data and still enables AI training. Moving computations to end-user devices also saves money – You don’t need powerfull server to train models anymore!
What is Federated Learning?
Federated Learning is a collaborative machine learning method with decentralized data and multiple client devices. During the FL process, each client (physical device on which the data is stored) is training model on their dataset and then each client sends a model to the server, where a model is aggregated to one global model and then again distributed over clients.
FL is not only a training process, but it also defines the whole infrastructure to prepare such process on client devices and aggregate AI model updates to perform the best accuracy.
The main aim of FL is to not touch client data. Nowadays data that produces average Joe is very sensitive and as we can see – very valuable. People mostly don’t want to share data such as words that they type on a mobile keyboard or medical data connected with a certain person.
FL is very important in case of mobile devices. In classical model training, we have to send client data to a server (very often really big datasets). In Federated Learning we are just sending a small amount of AI model numbers.
Pros and cons - comparsion with classical training
- Computations moved to end-user devices
- Better model accuracy because of having access to various data
- More secure apps – we are not transferring user data to a server
- We can learn many models simuntenaously at low cost
- FL is worth to use only if end-user device has various data and data should not be transferred out of the device
- We must build simply and very effective AI architectures (especially for mobile devices)
- Right now (August 2019) we don’t have FL platform for developers, so to provide fully FL for project You have to build Your own or wait for one of big companies like Google or Amazon to create one
- AI Model verification can become hard because of training “without” data
Another big benefit of FL worth talking is always updated model for never-seen data. Let’s suppose, that You bought an AI FL product, that has another company. Your dataset will update the global model (which You will receive in an update) as well as other companies. We can compare it to humans sharing their knowledge with others. Sharing model without sharing data is each other advantage.
As long as we’ve tested the FL concept we can use it in any case of AI training. That’s very good information because FL gives us many advantages without limits.
Using FL in each case is not necessary. Let’s suppose that we are building a recommendation system for our online shop. Its task is to offer to certain visitor new products based on his/her product views history. In this case, we’ve got visitors(clients) and online shop(server). So why we cannot use here Federated Learning? There is one simple answer to that question – it is excess of form. Each visitor generates data, that we can just fetch from him/her just by serving it. What this means is fact, that when a customer visits certain product page our server has to execute scripts and send that page(with product information – our data for AI) to customer. A server can anonymize that data using user id and then learn the scheme of choosing products by customers.
We should use Federated Learning when we have got an app, that can be installed on user’s device and our app process sensitive data. We should also consider if classical learning methods are enough for our tasks. Implementing FL infrastructure costs more than using classically trained models. A good example is GBoard (Google’s keyboard for Android devices). It uses the FL process to learn how certain group of people uses words during building sentences and add new words to other client’s vocabulary.
Use case: NLP task
- Headlines generation
- Conversation summaries – personal data
- Next word predictions
- Checking spelling mistakes
Use case: Image Processing
- Counting visitors
- Converting handwritten text recognition
- Objects detection
- Anomalies detection
In each case using multiple data sources is used to build better AI distributed over every source.
Example at healthcare
Healthcare is one of the best use-cases in federated learning. Here we have got tons of personal data that cannot be shared from hospitals, private doctors, health insurance companies, research institutes, etc. The data concerns sympotics, diseases, their progression, cure, medication, consequences, genetics and many other influencing factors, as well as statements that would be of great value for medicine. However, these data are personal and subject to data protection regulations (GDPR). So how to train a model without that data? Here comes Federated Learning. The training results are sent to a server, where the results of other doctors doing research on the same topic are also stored. On this server, a global AI model with better accuracy is created on the basis of the training results, which are distributed back to the end devices. There is no data exchange of personal data and at the same time the doctors always work with current and more and more precise algorithms, which they train in a community independent of location. AI can support humans during diagnosing e.g. tumors detection or bone damage based on images.
Example at voice recognition
Novadays, we have got an era of digital assistants. More and more people uses it in everyday life to keep things up. Having assistant powered device at home is controversial due to privacy. To use assistant You need to say wake-word and then command. Such devices shouldn’t record our voice all the time and wait for command, so we can use Federated Learning to train model for better recognition of wake-word and prevent it from listening us all the time keeping our secrets secret.
As I mentioned before, a good example of use-case in FL is GBoard, so Google gives us a framework for Federated Learning – Tensorflow Federated (TFF). For now, it gives us just a simulation environment for training models on decentralized data without network protocols – beta version. TFF has got good documentation and what is most important – it is open-source. Many people are asking for other(better) model aggregation algorithm, so we just have to wait for updates.
We have got OpenMinded too. They have build part of FL platform and they’ve tested their code on two raspberry pis, but…
“The whole process of installing the right Python version, Pytorch, PySyft, along with all the requires dependencies took me a couple of days.“ ~FL tutorial on theirs blog.
There is another work – FATE. They made almost the same job as OpenMinded and they are waiting for contributors on GitHub.
We can say, that any kind of deep learning framework can handle Federated Learning simulation. The point here is to implement a good aggregation algorithm.
We’ve used Fast.AI & PyTorch in FL image segmentation task in many scenarios. As we can see Federated Learning can give us better accuracy in the same amount of epochs.
We divided CAMVID dataset by 8 clients and then prepared two scenarios. In one every single client epoch we just averaged client models weights (48 epochs in total) and send to clients, in a second every 3 epochs on each client we aggregated model and distribute it over client 16 times (3×16 epochs in total).
1 epoch on client and 48 aggregations
(aggregation every one client epoch)
3 epochs on client and 16 aggregations
(aggregation every 3 client epochs)
Averaging models weights caused learning instability but gave better accuracy.
Reviews sentiment analisis
Here we’ve used TensorFlow Federated & Keras with IMDB dataset. Results were almost the same as in the Image Segmentation experiment, but we found that TFF averaging (averaging model weight updates) gave higher learning stability.
I believe that Google will build FL platform and give us tools for easy and good integrations with other programming languages. Federated Learning will become more and more popular mostly because of data security, better accuracy and moving computations to client devices.