for DeepLearning

Arm fracture detection in X-rays based on improved deep convolutional neural network

|

## ** Goal **

This paper’s goal is to propose a novel deep learning method to detect arm fracture in X-rays.

## ** Contribution **

1. New backbone network based on feature pyramid architecture

2. Image preprocessing procedure

**3. Receptive field adjustment with anchor scale reduction and tiny ROIs expansion

## ** Method **

** Backbone network**

FPN + Fast R-CNN +RPN with <u?Gaussian non-local attention module (refine integrated features)</u> Integration of features (Novel method)

** Preprocessing **

Noise removal –> Morphological method Brightening –> Cumulative distribution function

** Anchor scales reduction **

{P2; P3; P4; P5; P6} : {512; 256; 128; 64; 32}  {256; 128; 64; 32; 16} Guarantees more foreground RoIs for RPN training because GT bounding boxes are too small

** Expanding receptive field to fine tiny fracture **

Adding pixels to width and height for small ROIs (length adjustment) Extract useful info from tiny ROIs

Comment  Read more

Introduction to Human-In-The-Loop

|

0. Introduction to Human-in-the-Loop

Nowadays, machine learning, and deep learning is becoming more and more mainstream, and a lot of companies and startups try to use these algorithms for their projects. However, building a dataset for the project is very expensive and a lot of workers and researchers are trying to make this process more efficient. One of this practice is called **“human-in-the-loop” ** computing.

Human-in-the-loop process can be simply explained as a loop when machine relies on human and adds human judgment to its model when it isn’t sure what the answer is. This simple pattern is at the heart of many well-known, real-world use of cases in deep learning. Also, it is awesome because it solves one of the biggest issues with machine or deep learning, namely: it’s often very easy building an algorithm to 80 percent accuracy but often impossible to get an algorithm to 99 percent. The best model lets humans handle the 20 percent since 80 percent accuracy is not satisfactory enough for most of the real-world applications. Let’s look at some examples.

Self-driving cars

Self-driving cars are a great example of implementing “human-in-the-loop” in deep learning. A lot of smart people had been working on smart cars for a very long time and the state-of-the-art tech is actually pretty good. However, the biggest problem is that even though if a model achieves 99% in accuracy, this means that for 1% failure, a person might die.

Tesla is launching an automating driving mode that followed exactly the human-in-the-loop pattern. The car mostly drives itself on highways but insists that the human should keep their hands on the wheel for emergency. When the deep learning intelligent system senses doubt or uncertainty- maybe snow, strong illumination, or something the car has never seen before- it hands the control back to the human driver. So while the car can indeed drive itself almost at all times, it still needs human assistance.

Labeling images in Facebook

Facebook’s photo recognition algorithm has gotten crazy good! In fact, Facebook recognizes almost any photos you upload with high accuracy. But in cases where its confidence is below a certain threshold, Facebook will ask us (the uploader) to confirm the person or things labeled in the photo. Also, in cases where the confidence level is even lower, it will ask us the label the photo ourselves. All these data are fed back in to their algorithm to get better and better.

Building datasets for deep learning models

Obtaining large dataset is very expensive due to large amount of annotations. If the machine can just ask where to annotate this will reduce a lot of work. Recently a lot of researchers are working on active learning, which the machine itself learns where it needs more information and asks the oracle (human annotator) to give labels for it. By achieving this method, we can use deep learning even with small datasets.

1. Summary

We could see that human-computer interaction is much more important for artificial intelligence than we ever thought. Making sure computers and humans work well together is critical for all or the applications to work. Therefore human-in-the-loop process is very important now and in the future.

Artificial intelligence is here and it’s changing every aspect every dat. But we should know that it’s not replacing one’s job but making people in every job more efficient by being an assistant handling the easy cases and watching and learning from the hard cases.

Comment  Read more

Introduction to Model Uncertainty

|

0. Introduction to Bayesian Learning

In analyzing data or making decisions, it is necessary to be able to tell whether a model is certain about its output, being able to ask “maybe I need to use more diverse data? Or change the model? Or perhaps be careful when making a decision?”. Such questions are of fundamental concern in Bayesian machine learning. As mentioned earlier before this post, deep learning models generally only have point estimates of parameters and predictions at hand. The use of such models forces us to sacrifice the tools for answering the questions above, potentially leading to situations where we can’t tell whether a model is making sensible predictions or just guessing at random. Most deep learning models are viewed as deterministic functions and as a result viewed as operating in a very different setting to the probabilistic models which possess uncertainty information. I will not go over basic deep learning criteria, but you will find a lot of posts about it.

1. Model uncertainty

Deep learning can be applied for diverse applications such as skin cancer diagnosis, autonomous vehicles, and dog breed classification websites. For example, given several pictures of dog breeds as training data, when a user uploads a photo of his dog, the website returns a prediction of the breed with high confidence. But what will happen if we input a cat’s image instead of a dog image? The image of a cat is an example of out of distribution test data. The model has been trained on photos of dogs of different breeds and learned how to distinguish them by giving out an output. However, by inputting a cat’s image, the model will not say anything about “cats” but will output one of the dog breeds with confidence. The illustrative example can be extended to more serious settings, such as MRI scans for patients, or autonomous car steering system. These examples are directly related to a person’s life and should be seriously considered. This means that we really must trust the AI by guarantying our only life. Therefore, we want our model to possess some quantity conveying a high level of uncertainty with such inputs. This means we want our model to say, “I don’t know!!” if the models receive a strange input or have low confidence.

Other situations that can lead to uncertainty include

Noisy data (for example because of measurement imprecision, leading to *aleatoric uncertainty)

  • Uncertainty in model parameters that best explain the observed data (for example, weights and biases for deep neural network in which case we might be uncertain about what parameter should we choose.)

  • And structure uncertainty (what model structure should we use?)

Uncertainty information is also very important for the practitioner. Understanding if a model is under-confident or falsely over-confident (which means its uncertainty estimates are too small), can help better performance out of it. Interestingly, perhaps much more important, model uncertainty information can be used in systems that make decisions that affect human life such as in medical domain or autonomous control of drones or cars. This can all be considered under the umbrella field of AI safety. The more the consequence for the prediction, the more we should concern the uncertainty.

Let’s consider about some of the examples. With self-driving cars, low level feature extraction such as image segmentation and image localization are used to process raw sensory input. The output of such models is then fed into higher-level decision-making procedures. The higher-level decision making can be done through expert systems for example relying on a fixed set of rules (“if there is a pedestrian at your right, you should not steer right.”) However, mistakes done by lower-level components can propagate up the decision-making process and lead to devastating results. This means one little failure of the algorithms can result in serious problems. In such modular systems, one could use the model’s confidence in low-level components and make high-level decisions given this uncertainty information. For example, a segmentation system which can identify its uncertainty in distinguishing between the sky and another vehicle could alert the user to take control over the steering.

2. Applications of model uncertainty

Beside AI safety, there are many more applications which rely on model uncertainty. These can include choosing what data to learn from or exploring an agent’s environment efficiently. Common to both these tasks is the use of model uncertainty to learn from small amounts of data. This is often a necessity setting in which data collection is expensive or time consuming.

Many machines learning algorithms, including deep learning, often require large amounts of labelled data to generalize well. The amount of labelled data required increases with the complexity of the problem or the complexity of the input data. We should note that as complexity of the problem increases, we need way more data. To automate MRI scan analysis for example, this would require an expert to annotate many MRI scans, labelling them each part by part and indicate whether the patient have cancer or not. But expert time is very expensive and obtaining large amount of annotated data is a serious expensive issue. So how can we learn in settings where labelled data is scarce and expert knowledge is expensive?

One possible approach to the task could rely on active learning. This means the model itself would be able to choose what unlabeled data would be most informative for it and ask an external “oracle” (human annotator for example) for a label only for these new datapoints. The choice of data points to be labelled is done through an acquisition function which ranks points based on their potential informativeness. By following this learning framework, we can decrease large amount of required data by orders of magnitude, while still maintaining good model performance. In the result, we should aim to produce good uncertainty estimates for image data and rely on these to design an appropriate function. For later, we will try to cover and develop extensions of such tools that can be deployed in small data regimes and provide good model confidence.

3. Model uncertainty in deep learning

It is important to note that most deep learning models do not offer model confidence. Regression models output a single vector regressing to the mean of the data and classification models output a probability vector (SoftMax) at the end of the pipeline which is often erroneously interpreted as model confidence.

The image above shows a sketch of softmax input and output for an idealized binary classification problem. When training data is given between the dashed grey lines, function point estimate is shown with a solid line. Function uncertainty is shown with a shaded area. Marked with a dashed red line is a point x* far from training data and by ignoring function uncertainty, point x* is classified as class 1 with probability 1. This shows a model can be uncertain in its predictions even with a high softmax output. Passing a point estimate of a function trough a softmax results in extrapolations with unjustified high confidence for points far from the training data. However, passing the distribution (shaded area for the picture) through a softmax, better reflects classification uncertainty far from the training data. You might not get this part clearly. But let’s try to slowly figure this out.

Even though modern deep learning methods do not capture model confidence, they are closely related to a family of probabilistic models which induce probability distributions over functions: the Gaussian process. By placing a probability distribution over each weight (other than just a constant), a Gaussian process can be recovered in the limit of infinitely many weights. When we place distributions over the weights, we call that model, Bayesian neural networks. These networks may look fancy but some of these models are quite difficult to work with-often requiring many more parameters to be optimized and haven’t really caught-on within the deep learning community, perhaps because of their limited practicality.

So how should we make it practical? Well these methods should apply well to modern architectures including CNNs and RNNs. And we will thus concentrate on the development of practical techniques to obtain model confidence in deep learning, techniques which are also well rooted within the theoretical foundations of probability theory and Bayesian modelling. We will make use of stochastic regularization techniques (SRTs). These techniques adapt the model output stochastically as a way of model regularization which results in the loss becoming a random quantity, which is optimized using tools from the stochastic non-convex optimization literature. Popular SRTs include dropout, multiplicative Gaussian noise, drop Connect, and countless other recent techniques.

Let’s say that we can take almost any network trained with an SRT and given some input x’ obtain a predictive mean E[y’] (the expected model output given our input), and predictive variance Var[y’] (how much the model is confident in its prediction). Then we simulate a network output with given x’, applying SRT as if we were using the model during training (obtaining a random output through a stochastic forward pass). We repeat this process several times (for T repetitions), sampling outputs {y1’(x’), y2’(x’), …yT’(x’)}. As will be explained below, these are empirical samples from an approximate predictive distribution. We can get an empirical estimator for the predictive mean of our approximate predictive distribution as well as the predictive variance (uncertainty) from these samples:

More justification and explanation will be given on next post. Equation (1.4) results in uncertainty estimates which are practical with large models and big data, and that can be applied in image-based models, sequence-based models, and many different settings such as reinforcement learning and active learning.

For future work, we will try to apply this Bayesian concept to many models today. The most notable of these are a theoretical analysis of Monte Carlo estimator variance used in variational inference, a survey of measures of uncertainty in classification tasks, an empirical analysis of different Bayesian neural network priors and posteriors with various approximating distributions, new quantitative results comparing dropout to existing techniques, tools for heteroscedastic predictive uncertainty in Bayesian neural networks, applications in active learning, a discussion of what determines what our model uncertainty looks like, an analytical analysis of the dropout approximating distribution in Bayesian linear regression, an analysis of ELBO-test log likelihood correlation, discrete prior models, an interpretation of dropout as a proxy posterior in spike and slab prior models, as well as a procedure to optimize the dropout probabilities based on the variational interpretation to separate the different sources of uncertainty.

Comment  Read more

Why Tensorflow?

|

1.1 Why Tensorflow?

Tensorflow is a library for machine learning, especially deep learning program which makes us able to create codes much easier. This library is made by Google and we can easily use in Windows, Mac, Linux, Android, Ios, etc.

But we should note that Tensorflow is not the only library for machine/deep learning. There are also Torch, Caffe, MXNet, CNTK etc.

So why use Tensorflow??

My answer is that Tensorflow has a large community! For an engineer like me, fixing problems and debugging is really important. Also, a lot of engineers now use Tensorflow and we can see that most of the codes in GitHub are using Tensorflow. In Korea, there is a special community on Facebook called TensorFlow KR which we can share a lot of information.

I’m not saying that other libraries are bad~ But, if you want to start learning about libraries or deep learning, I recommend you to start from learning Tensorflow!!! My codes in this blog will mostly be done in Tensorflow too.

1.2 Installing Tensorflow

Installing Tensorflow : Click

Comment  Read more

Introduction to Bayesian Deep Learning

|

0. Introduction to Bayesian Deep Learning

When it comes to Deep Learning, we usually think of a Neural Network.

Today, when numerous amount of training data are given, the network fully trusts the data and finds the parameters according to the data. Wouldn’t this cause any problems?

Nowadays, the trend shifts to the fields of uncertainty. Uncertainty is when the network tells you “I don’t know about this input…” When it comes to classification, not knowing a thing is also a very important issue.

So how can we measure the uncertainty, and what should we do to fully understand it?

In order to fully understand Bayesian Deep Learning, we need to go slowly step by step. For those who are not familiar with math especially probabilities, you should try to study them first before reading this post.

Let’s look at the picture below,

This paper visualized the uncertainty for Semantic Segmentation. By using this ‘uncertainty’ term, we were able to decrease the noisy training data and have outstanding results.

My goal is to post covering Bayesian Deep Learning with reference to this paper.

1. Two Bayes Rules in Machine Learning

If you have a certain knowledge of machine learning or pattern recognition or basic probabilistic, you might be familiar with the following equation.

So, the equation above shows you the probability for y with given x (left). You can either use this probability for classification or regression. For example, we can select the highest probability to select or guess a class for input for classification.

But in Deep Learning point of view, you might feel something awkward. Well, there should be a network, but why are there no parameters? Of course, this is an equation for Naive Bayes classifier which has no Network Parameter. The parameters are the mean and the variance.

Let’s look at the Bayes Rule below.

Here, we can see a new parameter W. So, we can see the Y for the posterior probability shifted to the right side. What is the difference?

Let’s try to change the equation above a little bit.

Semantically, Posterior is the probability when input and output are given while Likelihood is the probability of the output when the input and the parameters are given. Prior is the probability of the parameter, and evidence is the probability of the output when given the input.

Well~ for the posterior probability, this is simply the probability distribution for the parameter when the training data is given.

But, Likelihood has separate input X and output Y. This confuses us. However, if we think carefully, the probability of the output is influenced more by the parameter than the input. If the input and the output are fixed, the only thing we can manipulate is the parameter. Also, the concept of Likelihood means how much the situation can be explained.

In order to easily understand Bayes Rule, let’s try to remove the input X which can be shown as follows.

Instead, we will combine the concept of input and the output as Dataset D, which is shown as follows.

The Likelihood from equation 2-a can be explained by what are the X and Parameter which can best describe Y?

The Likelihood from equation 2-c can be explained by what are the parameters which can best describe the given data?

Well, these two are eventually the same concept.

If your interest is in the left side, Posterior, right side, Likelihood. Easy peasy~

And we should note that the prior is the distribution what we are interested in (parameters!!).

Depending on our purpose, we will want sometimes the Likelihood, sometimes the posterior. We will find the W that maximizes the likelihood and the posterior.

2. Maximum Likelihood Estimation(MLE) vs Maximum A Posteriori(MAP)

Before we get down to business, lets quickly cover up the basic concepts of MLE and MAP.

As mentioned earlier, P(D|W) is the Likelihood and we want to find the W which maximizes the likelihood. This is called **Maximum Likelihood Estimation(MLE) **, which we can think of finding the W which best describes the training data.

So our goal is to first solve P(D|W) or P(Y|X, W). How do we do this??

When we think of Regression, P(Y X, W) is a Gaussian distribution which is noise. The mean is the output f(x) for the Network, which is composed of parameter W, and the variance is random.

Then we can get the W by maximizing the likelihood given by the data.

Maximizing the Likelihood is the same with minimizing the Mean Squared Error.

For classification problem, if we consider the output as a Multi-Bernoulli distribution, maximizing the Likelihood is the same as minimizing the Cross Entropy.

Now, let’s think about maximizing the posterior, which is called Maximum A Posteriori (MAP). Maximizing the posterior means maximizing the Likelihood and the Prior.

This is just a thing where the Prior term is added to MLE.

This means that while solving MLE, we are additionally maximizing the probability of W while assuming that W will follow a known distribution. This is very similar to the Regularization term.

MLE and MAP is explained well here

Bayesian Deep Learning vs Deterministic Deep Learning

Let us first go over the term of Deep Learning. The process is as follows.

  • Optimizing the parameter W with the training data.
  • The prediction for y remains fixed when W becomes fixed.

This means that we found W by completely trusting all the training data.

You can think that this is very similar to MLE which is maximizing the P(Y|X, W). We can solve MAP by adding the term P(W) which W is still fixed. Because we already solved for W which maximizes the posterior.

This is why we call this Deterministic Deep Learning

So what is the difference compared to Bayesian Deep Learning? Maybe you can understand it briefly by looking at the picture below. The left side shows the common Deep Learning method and the right side shows the Bayesian one.

We can see that Bayesian network doesn’t fix the W but fixes the distribution of the W. If we think W is the value on a dice, every time we role a dice, different W pops out. This means that every time we roll a dice, different Neural Network with different W pops out!!

If you have some knowledge of deep learning, you might think this as a similar term with Model Ensembling or DropOut.

This is also true.

Back to the dice metaphor, the output now depends on the value of the dice which is random. We now might be able to solve or assume the range of the output if we can formulate an equation with the random output.

We now need P(W|D), which is the distribution of W with the given Dataset. Our goal is to know everything about W which can be done by solving the Posterior.

Huh? Didn’t we get the P(W|D) value from MAP? NOPE! We only found the W which maximizes P(W|D).

Why can’t we solve P(W|D)? This is because the Marginal Likelihood should be integrated with all the W but this is very very hard to get!!

This is not easy to derive so we use Variational Inference for approximation.

We also need to solve for P(y|x) other than P(W|D).

By the Conditional Expectation equation below, we can find the distribution of the output. We should note here that we are not solving for the fixed output. We are trying to get the distribution of the output.

But this is also not easy which needs Sampling or additional Approximation or etc. I will try to introduce one of the techniques for this.

TLDR

  • Machine Learning’s goal is to find W and predict Y
  • Non=Bayesian Deep Learning is to find the fixed W value and predict the fixed Y
  • Bayesian Deep Learning’s goal is to get the distribution of W (P(W|D)) while also predicting the distribution of Y (P(Y’|X’ , W).

For future work, I will cover

  • So how on earth do we get the probability P(W|D)
  • How do we get P(Y’|X’ , W)
  • How do we get the uncertainty and where can we use it??

Comment  Read more