AND Logic Gate - Importance of bias units

Previously, we talked about simple OR gates and now we’ll continue that discussion with AND gates, and specifically the role of bias units. We often neglect to consider the role bias plays in our models. We know that we should include bias units, but why? Here, I’ll walk through a short example using an AND gate to highlight the importance of the bias unit.

Bias units allow us to offset the model in the same way that an intercept allows us to offset a regression line.

Imagine a simple AND gate. It will only fire if both inputs are true:

Input 1	Input 2	Output
0	0	0
0	1	0
1	0	0
1	1	1

This relationship can be visualized like this:

As you can see from the plot, the linear separator cannot both cross the origin (0,0) and correctly split the categories, so we need to add a bias unit to offset the model.

What would happen if we omitted the bias unit? The following code creates a logistic regression model without bias (2 inputs, 1 output, sigmoid activation):

# Set inputs and correct output values
inputs = [[0,0], [1,1], [0,1], [1,0]]
outputs = [0, 1, 0, 0]

# Set training parameters
alpha = 0.1 # Learning rate
training_iterations = 30000

# Define tensors
x = T.matrix("x")
y = T.vector("y")

# Set random seed
rng = np.random.RandomState(2345)

# Initialize random weights
w_values = np.asarray(rng.uniform(low=-1, high=1, size=(2, 1)),
                      dtype=theano.config.floatX)
w = theano.shared(value=w_values, name='w', borrow=True)

# Theano symbolic expressions
hypothesis = T.nnet.sigmoid(T.dot(x, w)) # Sigmoid activation
hypothesis = T.flatten(hypothesis) # This needs to be flattened
                                   # so hypothesis (matrix) and
                                   # y (vector) have same shape

cost = T.nnet.binary_crossentropy(hypothesis, y).mean()
updates_rules = [
    (w, w - alpha * T.grad(cost, wrt=w))
]

# Theano compiled functions
train = theano.function(inputs=[x, y], outputs=[hypothesis, cost],
                        updates=updates_rules)
predict = theano.function(inputs=[x], outputs=[hypothesis])

# Training
cost_history = []

for i in range(training_iterations):
    h, cost = train(inputs, outputs)
    cost_history.append(cost)

# Plot training curve
plt.plot(range(1, len(cost_history)+1), cost_history)
plt.grid(True)
plt.xlim(1, len(cost_history))
plt.ylim(0, max(cost_history)+0.1)
plt.title("Training Curve")
plt.xlabel("Iteration #")
plt.ylabel("Cost")

The training curve below shows that gradient descent is not able to converge on appropriate parameter values due to the lack of a bias unit:

As a result, the model makes incorrect predictions of 0.5 for the test data:

# Predictions
test_data = [[1,1], [1,1], [1,1], [1,0]]
predictions = predict(test_data)
print predictions

[0.5, 0.5, 0.5, 0.5]

The correct model for an AND gate must include a bias unit to offset the separator, and can be implemented as follows:

# Set inputs and correct output values
inputs = [[0,0], [1,1], [0,1], [1,0]]
outputs = [0, 1, 0, 0]

# Set training parameters
alpha = 0.1 # Learning rate
training_iterations = 30000

# Define tensors
x = T.matrix("x")
y = T.vector("y")
b = theano.shared(value=1.0, name='b') # Add bias unit

# Set random seed
rng = np.random.RandomState(2345)

# Initialize random weights
w_values = np.asarray(rng.uniform(low=-1, high=1, size=(2, 1)),
                      dtype=theano.config.floatX)
w = theano.shared(value=w_values, name='w', borrow=True)

# Theano symbolic expressions
hypothesis = T.nnet.sigmoid(T.dot(x, w) + b) # Sigmoid activation
hypothesis = T.flatten(hypothesis) # This needs to be flattened
                                   # so hypothesis (matrix) and
                                   # y (vector) have same shape

cost = T.nnet.binary_crossentropy(hypothesis, y).mean()
updates_rules = [
    (w, w - alpha * T.grad(cost, wrt=w)),
    (b, b - alpha * T.grad(cost, wrt=b))
]

# Theano compiled functions
train = theano.function(inputs=[x, y], outputs=[hypothesis, cost],
                        updates=updates_rules)
predict = theano.function(inputs=[x], outputs=[hypothesis])

# Training
cost_history = []

for i in range(training_iterations):
    h, cost = train(inputs, outputs)
    cost_history.append(cost)

# Plot training curve
plt.plot(range(1, len(cost_history)+1), cost_history)
plt.grid(True)
plt.xlim(1, len(cost_history))
plt.ylim(0, max(cost_history))
plt.title("Training Curve")
plt.xlabel("Iteration #")
plt.ylabel("Cost")

The training curve shows that gradient descent now converges correctly:

And we get correct predicted values:

# Predictions
test_data = [[1,1], [1,1], [1,1], [1,0]]
predictions = predict(test_data)
print predictions

[0.99999995, 0.99999995, 0.99999995, 0.00001523]

Bias units are typically added to machine learning models, and AND gates are a simple way of highlighting why they are important.

The full code can be found in my GitHub repo here (no bias) and here (with bias).