FK Engineering's Blog

Tuesday 28 June 2022

Creating Custom Obstacle Detection System Using TensorFlow and TensorFlow Lite for Mobile Robot

This post is a continuation of a project that I initiated in 2020 'Implementing custom CNN with DIY machine vision module'. In that post I created a small convolutional neural network (CNN) that is used to analyze a grayscale image and determine if there is object on the lower half of the image. The system is used as vision-based obstacle detection for small robots. At that time (2020) small single-board computers such as Raspberry Pi Zero do not have sufficient computing power to support TensorFlow Lite and popular machine vision package OpenCV at sufficient frame rate. However, with the arrival of Raspberry Pi Zero 2 in October 2021, which uses quad core ARM Cortex-A53 microprocessor (as opposed to single core in Raspberry Pi Zero), it is now possible to run convolutional neural network on Pi Zero 2 with frame rate of 5 frames-per-second or higher. This article aims to describe the steps in exporting the CNN in my 2020 post to TensorFlow Lite (We shall now use the term TF Lite for brevity for now onward), and deploying it to run on Raspberry Pi Zero 2 or similar embedded computers (Pi 4B, Jetson etc). Note that every thing you need can be found on tensorflow.org and other online resources, I just summarize all essential steps in this post.

There are a number of steps we need to take care of to make our CNN run on an embedded computer:

1. Converting the CNN model into TF Lite format so that it uses less resources and run faster at the expense of slight impact to the accuracy. Save the TFLite model on the computer hard disk.

2. Install the TFLite runtime (a bare-minimum library to perform inference with the neural network).

3. Setup a camera and codes to acquire image frame from the camera. For this we can use the OpenCV library, or the Pygame library.

4. Perform the necessary pre-processing on the image frame, for instance rescale, crop, normalize and turn it into tensor or numpy array.

5. Feeding the input to TF Lite runtime and accessing the output.

The details for the above steps will be described below. Here is a video showing how the system performs. Here I am using Raspberry Pi 3A+. It is already hard to find now in 2022 and have slightly higher computing power than Raspberry Pi Zero 2.

Video 1 - Demonstration of the system.

1. Converting TensorFlow model to TFLite Format

Suppose we have already trained our neural network to sufficient accuracy. Before we can convert our TensorFlow model to TF Lite format, we need to save the model onto the computer hard disk. At this writing we can save our CNN in 3 formats:

The older H5 format.

High-level TensorFlow SavedModel format (used by Keras).

Low-level TensorFlow SavedModel format (used by TensorFlow API).

Things like the model architecture, weights, compilation info, optimizer setting and state will be saved (we can also select a sub-set of these). Here we will be using the SavedModel formats (both low and high level) as this is recommended by TensorFlow. The python codes below show how this is done for SavedModel format. Assuming model is our TensorFlow model.

# Save model in Keras (high-level) SavedModel format:
model.save("./Exported_model_keras/")

# Save model in tensorflow (low-level) SavedModel format:
tf.saved_model.save(model,"./Exported_model_tf")

Once we have saved the TensorFlow model to hard disk, the following python scripts are used to convert the model from TensofFlow to TF Lite. At the time of writing, the library to convert TensorFlow model to TF Lite requires that the original TensorFlow model be saved in low-level SavedModel format.

import os
import tensorflow as tf

SAVED_MODEL_DIR2 = '.\Exported_model_keras’    # Point to Tensorflow (high-level) SavedModel folder.
SAVED_MODEL_DIR = '.\Exported_model_tf'             # Point to Tensorflow (low-level) SavedModel folder.
EXPORT_MODEL_DIR = '.\TfLite_model'                    # Directory to store Tensorflow lite model.
# Path and filename of export model.
PATH_TO_EXPORT_MODEL = os.path.join(EXPORT_MODEL_DIR,'model.tflite’)

mymodel = tf.keras.models.load_model(SAVED_MODEL_DIR2) # NOTE: 14/6/2022 Somehow I kept getting
                                                       # error from python interpreter when this
                                                       # line is not executed.
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)

tflite_model = converter.convert()

# Save the model.
with open(PATH_TO_EXPORT_MODEL, 'wb') as f:
f.write(tflite_model)

In the code above, once converted the TF Lite model will be saved as a file named model.tflite in the following path:

.\TfLite_model\model.tflite

2. Installing TF Lite Runtime on Raspberry Pi

I have tested this on Raspberry Pi OS Buster and Bullseye. All the official TensorFlow guide for TF Lite can be found on:
https://www.tensorflow.org/lite
https://www.tensorflow.org/lite/guide/python

To install the TF Lite runtime on Raspberry Pi, we should update our Raspberry Pi OS to the latest version via the terminal:

sudo apt update

sudo apt full-upgrade

After that we just type the following command in the terminal:

python3 –m pip install tflite-runtime

There is a difference between the typical sudo apt update and sudo apt full-upgrade which is explained here. In the case of Raspbian Buster I discovered that an older version of TF Lite runtime will be installed if I just use sudo apt update. Unfortunately, at the time of writing this, there is no way to determine the version of TF Lite runtime except by uninstalling it and observing the message generated!

3. Loading the TF Lite Model

The python codes below illustrate how to load the TF Lite model. Here we assume that the TF Lite model has been successfully converted from TensorFlow model and saved as model.tflite in the folder ./TfLite_model. The input and output specifications of TF Lite model are contained in the object my_signature.Thus, after loading the TF Lite interpreter, we get the signature of the model.

TFLITE_MODEL_DIR = './TfLite_model'
PATH_TO_TFLITE_MODEL = os.path.join(TFLITE_MODEL_DIR,'model.tflite')

import tflite_runtime.interpreter as tflite # Use tflite runtime instead of TensorFlow.

interpreter = tflite.Interpreter(PATH_TO_TFLITE_MODEL)

# There is only 1 signature defined in the model, so it will return it by default.
# If there are multiple signatures then we can pass the name.
my_signature = interpreter.get_signature_runner()

# Optional, show the format for input.
input_details = interpreter.get_input_details()
# input_details is a dictionary containing the details of the input to this neural network.
print(input_details[0])
print(input_details[0]['shape'])
# Now print the signature input and output names so that we can call these later.
print(interpreter.get_signature_list())

4. Full Example With OpenCV

The codes below shows the full implementation of the TF Lite model. Here we use OpenCV library to manage the interface to the camera. This code can be used on either a computer with TensorFlow framework or Raspberry Pi with TF Lite runtime, just uncomment the relevant code sections. Of course, both the computer and Raspberry Pi must have OpenCV library installed in the python environment. OpenCV version 4.1.0 or newer is suitable.

import os
import numpy as np
import cv2

TFLITE_MODEL_DIR = '.\TfLite_model'
PATH_TO_TFLITE_MODEL = os.path.join(TFLITE_MODEL_DIR,'model.tflite')

_SHOW_COLOR_IMAGE = False

#Set the width and height of the input image in pixels.
_imgwidth = 160
_imgheight = 120

#Set the region of interest start point and size.
#Note: The coordinate (0,0) starts at top left hand corner of the image frame.
_roi_startx = 30
_roi_starty = 71
_roi_width = 100
_roi_height = 37

# --- For PC --- [Uncomment as necessary]
import tensorflow as tf
interpreter = tf.lite.Interpreter(PATH_TO_TFLITE_MODEL) # Load the TFLite model in TFLite Interpreter

# --- For Raspberry Pi ---
#import tflite_runtime.interpreter as tflite # Use tflite runtime instead of TensorFlow.
#interpreter = tflite.Interpreter(PATH_TO_TFLITE_MODEL)

# There is only 1 signature defined in the model,
# so it will return it by default.
# If there are multiple signatures then we can pass the name.
my_signature = interpreter.get_signature_runner()

# Optional, show the format for input.
input_details = interpreter.get_input_details()
# input_details is a dictionary containing the details of the input
# to this neural network.
print(input_details[0])
print(input_details[0]['shape'])
# Now print the signature input and output names.
print(interpreter.get_signature_list())

video = cv2.VideoCapture(0) # Open a camera connected to the computer.
video.set(3,2*_imgwidth)   # Set the resolution output from the camera.
video.set(4,2*_imgheight) #

# Calculate the corners for all rectangules that we are going to draw on the image.
pointROIrec1 = (2*_roi_startx,2*_roi_starty)
pointROIrec2 = (2*(_roi_startx + _roi_width),2*(_roi_starty + _roi_height))

interval = np.floor(_roi_width/3)
interval2 = np.floor(2*_roi_width/3)
# Rectangle for label1 (object on left)
pointL1rec1 = (2*(_roi_startx+4),2*(_roi_starty+4))
pointL1rec2 = (2*(_roi_startx +int(interval)-4),2*(_roi_starty + _roi_height-4))
# Rectangle for label2 (object on right)
pointL2rec1 = (2*(_roi_startx+4+int(interval2)),2*(_roi_starty+4))
pointL2rec2 = (2*(_roi_startx+_roi_width-4),2*(_roi_starty+_roi_height-4))
# Rectangle for label3 (object in front)
pointL3rec1 = (2*(_roi_startx+4+int(interval)),2*(_roi_starty+4))
pointL3rec2 = (2*(_roi_startx+int(interval2)-4),2*(_roi_starty+_roi_height-4))
# Rectangle for label4 (object blocking front)
pointL4rec1 = (2*(_roi_startx+4),2*(_roi_starty+4))
pointL4rec2 = (2*(_roi_startx + _roi_width-4),2*(_roi_starty + _roi_height-4))

print(pointL1rec1,pointL1rec1)
if not video.isOpened():            # Check if video source is available.
    print("Cannot open camera or file")
    exit()

while True:                         # This is same as while (1) in C.
    successFlag, img = video.read() # Read 1 image frame from video.

    if not successFlag:             # Check if image frame is correctly read.
        print("Can't receive frame (stream end?). Exiting ...")
        break

    imggray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)         # Convert to grayscale.
    imggrayresize = cv2.resize(imggray,None,fx=0.5,fy=0.5) # Resize to 160x120 pixels

    # Crop out region-of-interest (ROI)
    imggrayresizecrop = imggrayresize[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]

    # Normalize each pixel value to floating point, between 0.0 to +1.0
    # NOTE: This must follows the original mean and standard deviation
    # values used in the TF model. Need to refer to the model pipeline.
    # In Tensorflow, the normalization is done by the detection_model.preprocess(image)
    # method. In TensorFlow lite we have to do this explicitly.
    imggrayresizecropnorm = imggrayresizecrop/256.0                 # Normalized to 32-bits floating points

    #test = np.expand_dims(imgpgrayresizecropnorm,(0,-1)) # change the shape from (37,100) to (1,37,100,1),
                                              # to meet the requirement of tflite interpreter
                                              # input format. Also datatype is float32, see
                                              # the output of print(input_details[0])
    # --- Method 1 using tf.convert_to_tensor to make a tensor from the numpy array ---
    #input_tensor = tf.convert_to_tensor(test, dtype=tf.float32)

    # --- Method 2 to prepare the input, only using numpy ---
    input_tensor = np.asarray(np.expand_dims(imggrayresizecropnorm,(0,-1)), dtype = np.float32)

    output = my_signature(conv2d_input = input_tensor) # Perform inference on the input. The input and
                                                    # output names can
                                                    # be obtained from interpreter.get_signature_list()

    output1 = np.squeeze(output['dense_1'])         # Remove 1 dimension from the output. The output
                                                    # parameters are packed into a dictionary. With
                                                    # the name 'dense_1' to access the output layer.
    result = np.argmax(output1)

    if _SHOW_COLOR_IMAGE == True:
         # Draw ROI border on image
        cv2.rectangle(img,pointROIrec1,pointROIrec2,(255,0,0), thickness=2)
        # Draw rectangle for Label 1 to 4 in ROI
        if result == 1:
            cv2.rectangle(img,pointL1rec1,pointL1rec2,(255,255,0), thickness=2)
        elif result == 2:
            cv2.rectangle(img,pointL2rec1,pointL2rec2,(255,255,0), thickness=2)
        elif result == 3:
            cv2.rectangle(img,pointL3rec1,pointL3rec2,(255,255,0), thickness=2)
        elif result == 4:
            cv2.rectangle(img,pointL4rec1,pointL4rec2,(255,255,0), thickness=2)

        cv2.imshow("Video",img)           # Display the image frame.
    else:
        # Draw ROI border on image
        cv2.rectangle(imggray,pointROIrec1,pointROIrec2,255, thickness=2)
        # Draw rectangle for Label 1 to 4 in ROI
        if result == 1:
            cv2.rectangle(imggray,pointL1rec1,pointL1rec2,255, thickness=2)
        elif result == 2:
            cv2.rectangle(imggray,pointL2rec1,pointL2rec2,255, thickness=2)
        elif result == 3:
            cv2.rectangle(imggray,pointL3rec1,pointL3rec2,255, thickness=2)
        elif result == 4:
            cv2.rectangle(imggray,pointL4rec1,pointL4rec2,255, thickness=2)

        cv2.imshow("Video",imggray)           # Display the image frame.

    if cv2.waitKey(1) & 0xFF == ord('q'): # Note: built-in function ord() scans the
          break                           # keyboard for 1 msec, returns the integer value
                                          # of a unicode character. Here we compare user key
                                          # press with 'q'
# When everything done, release the capture resources.
video.release()
cv2.destroyAllWindows()

A demonstration of the code in action. Note that the original image is in color and 640x480 pixels. The algorithm resizes each image frame into 160x120 pixels, then convert the color channels into grayscale. A subset of this grayscale image, as delineated by the variables

_roi_startx = 30
_roi_starty = 71
_roi_width = 100
_roi_height = 37

Set the region-of-interest (ROI) where analysis will be carried out by the neural network. This ROI location and size can be adjusted, provided we retrain the network every time we adjust the ROI parameters. We can plot the image captured by the camera as in color or grayscale, by setting the variable _SHOW_COLOR_IMAGE to True or False in the code. Figure 1 below shows the file structure of the system as setup inside Raspberry Pi.

Figure 1 - Here our TF Lite model is stored in sub-folder "TfLite_model", while the python codes using the TF Lite model to perform inferencing on the camera images resides in current folder.

Video 2 - Using OpenCV as the frontend interface to capture camera images.

5. Full Example with Pygame

In addition to using OpenCV libraries to interface to Raspberry Pi on-board camera, we can also use other python libraries for this purpose. An alternative python library that I found suitable to replace OpenCV is the Pygame. Installing Pygame is just a single line of instruction in the terminal. So here is another version of implementing the TF Lite model with Pygame as the camera interface.

import os
import numpy as np
import pygame
from pygame import camera
from pygame import display

TFLITE_MODEL_DIR = './TfLite_model'
PATH_TO_TFLITE_MODEL = os.path.join(TFLITE_MODEL_DIR,'model.tflite')

# Original image size
_imgwidth_ori = 640
_imgheight_ori = 480
#_imgwidth_ori = 320
#_imgheight_ori = 240

# Set the width and height of the input image in pixels for tensorflow pipeline.
_imgwidth = 160
_imgheight = 120

# Set the region of interest start point and size.
# Note: The coordinate (0,0) starts at top left hand corner of the image frame.
_roi_startx = 30
_roi_starty = 71
_roi_width = 100
_roi_height = 37

pygame.init() # This initialize pygame, including the display as well.
camera.init()
mycam = camera.Camera(camera.list_cameras()[0],(_imgwidth_ori,_imgheight_ori),'HSV')
mycam.start()
screen = display.set_mode((_imgwidth_ori,_imgheight_ori))
display.set_caption("cam")

rescale_level = int(_imgwidth_ori/_imgwidth) # Scale to reduce the image size.
nrow = int(_imgheight_ori/rescale_level)
ncol = int(_imgwidth_ori/rescale_level)
averaging_coeff = 1.0/rescale_level

Ave_row = np.zeros((nrow,_imgheight_ori),dtype=float)
Ave_col = np.zeros((_imgwidth_ori,ncol),dtype=float)
for row in range(nrow):
    for index in range(rescale_level):
        Ave_row[row,rescale_level*row+index] = averaging_coeff

for col in range(ncol):
    for index in range(rescale_level):
        Ave_col[rescale_level*col+index,col] = averaging_coeff

# Codes to calculate the coordinates for rectangles and other structures
# that will be superimposed on the display screen as user feedback.
pointROIstart = (rescale_level*_roi_startx,rescale_level*_roi_starty)
pointROIsize = (rescale_level*_roi_width,rescale_level*_roi_height)
pgrectROI = pygame.Rect(pointROIstart,pointROIsize)

interval = np.floor(_roi_width/3)
interval2 = np.floor(2*_roi_width/3)
# Rectangle for label1 (object on left)
pointL1start = (rescale_level*(_roi_startx+4),rescale_level*(_roi_starty+4))
pointL1size = (rescale_level*(int(interval)-8),rescale_level*(_roi_height-8))
pgrectL1 = pygame.Rect(pointL1start,pointL1size)
# Rectangle for label2 (object on right)
pointL2start = (rescale_level*(_roi_startx+4+int(interval2)),rescale_level*(_roi_starty+4))
pointL2size = (rescale_level*(int(interval)-8),rescale_level*(_roi_height-8))
pgrectL2 = pygame.Rect(pointL2start,pointL2size)
# Rectangle for label3 (object in front)
pointL3start = (rescale_level*(_roi_startx+4+int(interval)),rescale_level*(_roi_starty+4))
pointL3size = (rescale_level*(int(interval)-8),rescale_level*(_roi_height-8))
pgrectL3 = pygame.Rect(pointL3start,pointL3size)
# Rectangle for label4 (object blocking front)
pointL4start = (rescale_level*(_roi_startx+4),rescale_level*(_roi_starty+4))
pointL4size = (rescale_level*(_roi_width-8),rescale_level*(_roi_height-8))
pgrectL4 = pygame.Rect(pointL4start,pointL4size)

# --- For PC ---
import tensorflow as tf
interpreter = tf.lite.Interpreter(PATH_TO_TFLITE_MODEL) # Load the TFLite model in TFLite Interpreter
# === For Raspberry Pi ---
#import tflite_runtime.interpreter as tflite # Use tflite runtime instead of TensorFlow.
#interpreter = tflite.Interpreter(PATH_TO_TFLITE_MODEL)

# There is only 1 signature defined in the model,
# so it will return it by default.
# If there are multiple signatures then we can pass the name.
my_signature = interpreter.get_signature_runner()

# Optional, show the format for input.
input_details = interpreter.get_input_details()
# input_details is a dictionary containing the details of the input
# to this neural network.
print(input_details[0])
print(input_details[0]['shape'])
# Now print the signature input and output names.
print(interpreter.get_signature_list())



is_running = True
while is_running:
    img = mycam.get_image()                             # Note: the return from get_image() is an object
                                                        # called Surface in pygame. This is a 2D array with the
                                                        # element being a 32-bits unsigned integer for the pixel
                                                        # where the 8-bits RGB (default format) components are
                                                        # coded as follows:
                                                        # pixel_value = (Rx256x256) + (Bx256) + R
                                                        # The multiply by 256 corresponds to left shift 8-bits.
    imgnp = np.asarray(pygame.surfarray.array3d(img),dtype=np.uint32)   # Convert 2d surface into 3D array, with the last index
                                                        # points to the color component.
    imgI = imgnp[:,:,2]                                 # Extract the V component.

    imgIt = np.transpose(imgI)                          # Flip the image array to the correct orientation.
    imgIresize = np.matmul(imgIt,Ave_col)               # Perform image resizing using averaging method.
                                                        # To speed up the process, instead of using dual for-loop,
                                                        # we use numpy matrix multiplication method. Here we
    imgIresize = np.matmul(Ave_row,imgIresize)          # multiply the image matrix on left and right hand side
                                                        # This performs averaging allow the row and column while
                                                        # reducing the width and height of the original image matrix.
    # Crop out region-of-interest (ROI)
    imggrayresizecrop = imgIresize[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    # Normalize each pixel value to floating point, between 0.0 to +1.0
    # NOTE: This must follows the original mean and standard deviation
    # values used in the TF model. Need to refer to the model pipeline.
    # In Tensorflow, the normalization is done by the detection_model.preprocess(image)
    # method. In TensorFlow lite we have to do this explicitly.
    imggrayresizecropnorm = imggrayresizecrop/256.0                 # Normalized to 32-bits floating points
    # --- Method 1 using tf.convert_to_tensor to make a tensor from the numpy array ---
    #input_tensor = tf.convert_to_tensor(test, dtype=tf.float32)

    # --- Method 2 to prepare the input, only using numpy ---
    input_tensor = np.asarray(np.expand_dims(imggrayresizecropnorm,(0,-1)), dtype = np.float32)

    output = my_signature(conv2d_input = input_tensor) # Perform inference on the input. The input and
                                                    # output names can
                                                    # be obtained from interpreter.get_signature_list()

    output1 = np.squeeze(output['dense_1'])         # Remove 1 dimension from the output. The output
                                                    # parameters are packed into a dictionary. With
                                                    # the name 'dense_1' to access the output layer.
    result = np.argmax(output1)
    print(result)

    imgnp[:,:,0] = imgI                             # Create a gray-scale image array by duplicating the luminance V
    imgnp[:,:,1] = imgI                             # values on channel 0 and channel 1 of the 3D image array.
    pygame.surfarray.blit_array(screen,imgnp)       # Copy 3D image array to display surface using block transfer.

    # Draw the ROI border on the screen.
    pygame.draw.rect(screen,(0,0,255),pgrectROI,width=rescale_level)
    # Draw rectangle for Label 1 to 4 in ROI
    if result == 1:
        pygame.draw.rect(screen,(255,255,0),pgrectL1,width=rescale_level)
    elif result == 2:
        pygame.draw.rect(screen,(255,255,0),pgrectL2,width=rescale_level)
    elif result == 3:
        pygame.draw.rect(screen,(255,255,0),pgrectL3,width=rescale_level)
    elif result == 4:
        pygame.draw.rect(screen,(255,255,0),pgrectL4,width=rescale_level)

    display.update()                                    # This will create a window and display the image.
    #display.flip()
    for event in pygame.event.get(): # Just close the window and a QUIT even will be generated
        if event.type == pygame.QUIT:
            is_running = False
mycam.stop()
pygame.quit()

_roi_startx = 30
_roi_starty = 71
_roi_width = 100
_roi_height = 37

Video 3 - Using Pygame as the frontend interface to capture the camera images. We also set the region-of-interest (ROI) where analysis will be carried out by the neural network.

Sunday 14 June 2020

Implementing Convolutional Neural Network (CNN) with DIY Machine Vision Module

By F. Kung

Last Updated: Updated 31 Dec 2021

In this post I will share my journey in implementing a convolutional neural network (CNN) in my DIY machine vision module (MVM). The MVM module in question is described in a previous post https://fkeng.blogspot.com/2016/01/machine-vision-module.html. A picture of it is shown below.

The current version of the machine vision hardware comprises a low resolution VGA CMOS camera, paired to an ARM Cortex M7 micro-controller (MCU), with a frame rate of around 20 frame-per-second (fps). More details of the machine vision hardware and software are described in the previous post. The focus of this post is on the technical details of training a CNN using Google's Tensorflow framework, and porting the CNN model into custom C codes (and a bit of Assembly) that runs on the ARM Cortex M7 MCU in the machine vision module. This system is what is typically called the Edge Artificial Intelligence (AI) processing, i.e. the computation of the AI algorithm is performed locally on the machine. Before I proceed, I wish to clarify a few things:

This post is not a tutorial on neural network nor explanation on how to use Google TensorFlow machine learning library. So it is assumed the reader is already familiar with these topics.
The MVM is intended for use in a mobile robot for navigation or obstacle avoidance purpose. Hence the CNN in the MVM is used to perform image analysis of each image frame captured by the camera to estimate the position of obstacles.
The python codes for the CNN is developed using the Spyder IDE. However, any python IDE should be fine.
Here I am using TensorFlow V2.0 library [Update 15 June 2022, also tested to work well on V2.91 library].
If you are looking for a simple way to implement a powerful neural network or edge AI-based system, alternatives like ESP32-CAM, Pixy CAM or Open MV would be better. You can also opt for commercial edge AI processor board, such as Jetson Nano, Google Coral development board or even Raspberry Pi.

1. Introduction

The idea is to have a machine vision module (MVM) that is attached to a mobile robot, and continuously scan the floor or surface in front of the robot at 20 fps. For each captured image, the image processor in the MVM will run a CNN forward propagation calculation routine, trying to identify any object or obstacle in front of the robot. To reduce the size of the CNN model, processing time and memory requirement, the following steps are taken:

Only the gray scale or luminance output of the image frame is used.
Resolution of the image is reduced to QQVGA, or 120x160 pixels.
Because the objective is to analyze detect obstacles in front of the mobile robot, only a small portion of the gray scale image of size 37x100 pixels is subject to the CNN analysis.
The CNN has 5 output classes: 'Left', 'Front', 'Right', 'Blocked' and 'No Object' (The corresponding label values are 1, 3, 2, 4 and 0).

With the steps above, it is possible for the CNN codes in the image processor to complete the analysis of an image frame in less than 50 miliseconds (for 20 fps frame rate) on my DIY Machine Vision Module (MVM). This idea is shown in Figure 1 below, with Figure 2 illustrating the 5 classes of output for the CNN. The software for the MVM is still a work-in-progress, in future when efficiency of the codes can be improved, more layers and output classes can be added (for instance a class for object on left and right sub-regions, but not in the front). For now I find that with these 5 output classes, it is sufficient for the mobile robot to navigate in the environment.

Figure 1 - Screen capture of the MVM monitor software showing 'perspective' of the MVM mounted on a small mobile robot. In the figure the algorithm in the MVM image processors highlights that an object/obstacle is present in the Right sub-region.

Figure 2 - Examples for 'Blocked' (label = 4), 'Front' (label = 3), 'Left' (label = 1), 'Right' (label = 2) and 'No Object' (label = 0) classification outputs.

The following sections will discuss the various topics needed to train the CNN model, export the weights and bias and implementation of the inference or forward propagation computation in C/Assembly codes. Here are two short videos to show how the system works:

Video 1 - Demo of the system mounted on a small mobile robot (static).

Video 2 - Demo of the system with the mobile robot moving in autonomous mode.

2. Saving Image from Machine Vision Module (MVM) onto Computer Harddisk

In order to train the CNN using back propagation method with Tensorflow library, one need to feed in the image data and the label (of the class) to the CNN model created with Tensorflow. Each image is stored as 2D NumPy array, with each elements of the 2D Numpy array represents the value of a pixel normalized to between 0.0 to 1.0. The element value is stored as 32-bits floating point datatype by default, it is also possible to use a fixed point datatype since the value is between 0.0 to 1.0. In the case of color image we will have three 2D Numpy arrays (one for each RGB channel) for each image. Normally we would not feed 1 image at a time to train the CNN, but a batch of images. Suppose we have 500 gray scale images of resolution 50x100 pixels to be fed into the CNN model, a multi-dimensional Numpy array of dimension (500,50,100) would need to be created. The 1st index points to the image number or sequence, with 2nd and 3rd indices refer to the pixel coordinate in the (x,y) sense.

A convenient method to fill up this multi-dimenstional Numpy array is to read the image file one-by-one from the computer harddisk using python Matplotlib.pyplot graphical plotting library. The pyplot object contains a imread( ) method that can read a number of image format into a multi-dimensional Numpy array. For this project as the resolution of the image is low, every single pixel is important. Thus if we use image compression to reduce the image size in harddisk, it is important to use lossless compression approach. Here I simply store each image in bitmap (BMP) format. The MVM module can be linked to a monitor software, where the user can see the image captured by the MVM camera in real time. The MVM monitor software has a function to export the raw gray scale image into BMP format and save it in the computer harddisk (See Figure 1). Further information on bitmap format can be obtained from [1]. The MVM monitor software is written in Visual Basic .NET, the sourcecode in Listing 1 is an example of how to save gray scale image in 24-bits BMP format. Here we assume there is a button called ButtonSaveBMP, and a SaveFileDialogbox object called SaveFileDialog1 has already been instantiated in the MVM monitor software codes.

Listing 1 - Visual Basic .NET subroutine to save bitmap file.
Private Sub ButtonSaveBMP_Click(sender As Object, e As EventArgs) Handles ButtonSaveBMP.Click
        Dim nYindex As Integer
        Dim nXindex As Integer
        Dim bytData(0 To (3 * mintImageWidth) - 1) As Byte   ' Size is 3x(no. of pixels per line)
        Dim nPixel As Integer
        Dim bytBITMAPFILEHEADER() As Byte = {&H42, &H4D, 0, 0, 0, 0, 0, 0, 0, 0, &H36, 0, 0, 0} 'Metadata, file header, 14 bytes.
        Dim bytBITMAPINFOHEADER() As Byte = {&H28, 0, 0, 0, &HA0, 0, 0, 0, &H78, 0, 0, 0, &H1, 0, &H18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} 'Metadata, bitmap info header, 40 bytes.
        'There is also an optional color table metadata for BMP format which we did not use here. The color table is only needed when BPP is less than 16 bits.

        Try
            mFilePath = TextBoxFileNum.Text

            If mFilePath <> "" Then              'Check if filename is valid.
                SaveFileDialog1.Title = "Save Bitmap File"
                SaveFileDialog1.CheckFileExists = False
                SaveFileDialog1.DefaultExt = "bmp"
                SaveFileDialog1.Filter = "bitmap files (*.bmp)|*.bmp"
                SaveFileDialog1.FileName = mFilePath
                If SaveFileDialog1.ShowDialog() = DialogResult.OK Then
                    mFilePath = SaveFileDialog1.FileName
                End If
            Else
                MessageBox.Show("Filename not valid", "ERROR", MessageBoxButtons.OK)
            End If

            If mFilePath <> "" Then ' Only proceed if filename is valid.
                My.Computer.FileSystem.WriteAllBytes(mFilePath, bytBITMAPFILEHEADER, False) 'False to overwrite the content.
                My.Computer.FileSystem.WriteAllBytes(mFilePath, bytBITMAPINFOHEADER, True) 'True to overwrite the content.
                For nYindex = 0 To mintImageHeight - 1
                    For nXindex = 0 To mintImageWidth - 1
                        nPixel = 2 * mbytPixel2(nXindex, mintImageHeight - 1 - nYindex) 'The original luminance value is between 0-127,
                        'here we multiply by 2 to normalize it to between
                        '0 to 255.
                        bytData(3 * nXindex) = nPixel        'Construct a grayscale pixel.
                        bytData((3 * nXindex) + 1) = nPixel     'Format is BGR. Make sure total bytes per line is divisible by 4.
                        bytData((3 * nXindex) + 2) = nPixel
                    Next
                    My.Computer.FileSystem.WriteAllBytes(mFilePath, bytData, True) 'True to overwrite the content.
                Next
            End If

        Catch ex As Exception
            MessageBox.Show("Save file: " & ex.Message, "ERROR", MessageBoxButtons.OK)
        End Try
    End Sub

3. File Organization

All the image files are sorted according the the output class and stored in separate folders. Figure 3 shows the directory tree for the training images and test images. Each image file is named as number, for instance in Figure 3 we see that in the Right sub-folder under the Train Image folder, we have bitmap image files 0.bmp, 1.bmp, 2.bmp and so on. Subsequently in the python codes for CNN model, we just need to concatenate the path and the filename to create a valid path to each image and import into a 2D Numpy array. Listing 2 is an example of how this is done in python.

Figure 3 - Directory structure for storing the images.

Listing 2 is a python example of how to load all the bitmap image files in a sub-folder into a multi-dimensional numpy array. Here it is assumed that all the bitmap image in the sub-folder are similar in size an the folder TrainImage is at the same level as the python code in the hard disk. Each image is first read into a temporary 2D numpy array, it is then cropped to the required size (as determined by the constants _roi_startx, _roi_starty, roi_width, _roi_height), and transferred for storage in the multi-dimensional numpy array from training images.

Listing 2 - Python codes to load a series of bitmap images file from computer hard disk to numpy array.
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os

#The start point, width and height of the region-of-interest (ROI) of each image that will be subject to
#analysis by the CNN.
_roi_startx = 30
_roi_starty = 71
_roi_width = 100
_roi_height = 37

train_dir = os.path.join('./TrainImage/Right') #Create a path to the sub-folder with object on right in training image directory
train_names = os.listdir(train_dir)   #Create a list containing the filenames of all image files in the sub-folder.
train_num_files = len(os.listdir(train_dir)) #Count the number of files in the sub-folder.

#Create an empty 3D array to hold the sequence of 2D image data and 1D array to hold the labels
train_images = np.empty([train_num_files,_roi_height,_roi_width])
train_labels = np.empty([train_num_files])
i = 0
for train_image_file in train_names: #Training images, no object.
    #Read original BMP image
    image = plt.imread(train_dir+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 3 #Label value for object on the right.
    i = i+1

4. The Complete CNN Architecture

To keep things simple I opt for a simple 4-layer structure as illustrated in Figure 4 (it could also be interpreted as a 3-layer structure depending on our definition of the 2D Max-Pooling function). At the moment I find this structure adequate for my needs, it is also possible to add another dense neural network layer. That is all I can fit into the bandwidth of the processor in the MVM for 20 fps operation or 50 ms interval. If the frame rate is reduced to 10 fps, then we have 100 ms interval to process each frame and it is possible to add a second 2D convolution layer after the first convolution layer. The complete python code, from loading the bitmap images, creating the CNN model using TensorFlow Keras API up to training the model is shown in Listing 3.

Figure 4 - The CNN structure adopted for this project.

Listing 3 - Python codes for instantiating and training the CNN model with TensorFlow Keras API.

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os

tf.keras.backend.clear_session() # For easy reset of notebook state.

#Set the width and height of the input image in pixels.
_imgwidth = 160
_imgheight = 120

#Set the region-of-interest (ROI) start point and size.
#Note: The coordinate (0,0) starts at top left hand corner of the image frame.
_roi_startx = 30
_roi_starty = 71
_roi_width = 100
_roi_height = 37
_layer0_channel = 16 #Number of convolution kernel/filters.

_DNN1_node = 35      #Number of nodes for dense NN layer.
_DNN2_node = 5        #Number of output nodes.

train_dir = os.path.join('./TrainImage/NoObject') #Create a path to the folder for no object in training image directory
train_names = os.listdir(train_dir)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'no object': ", train_names)
print("")
train_num_files = len(os.listdir(train_dir))

train_dir2 = os.path.join('./TrainImage/Left') #Create a path to the folder with object on left in training image directory
train_names2 = os.listdir(train_dir2)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'With object on left': ", train_names2)
print("")
train_num_files2 = len(os.listdir(train_dir2))

train_dir3 = os.path.join('./TrainImage/Right') #Create a path to the folder with object on right in training image directory
train_names3 = os.listdir(train_dir3)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'With object on right': ", train_names3)
print("")
train_num_files3 = len(os.listdir(train_dir3))

train_dir4 = os.path.join('./TrainImage/Front') #Create a path to the folder with object in front in training image directory
train_names4 = os.listdir(train_dir4)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'With object in front': ", train_names4)
print("")
train_num_files4 = len(os.listdir(train_dir4))

train_dir5 = os.path.join('./TrainImage/Blocked') #Create a path to the folder with object blocking the front in training image directory
train_names5 = os.listdir(train_dir5)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'With object in blocking the front': ", train_names5)
print("")
train_num_files5 = len(os.listdir(train_dir5))

#--- Load training images and attach label ---

#Create an empty 3D array to hold the sequence of 2D image data and
#1D array to hold the labels
train_images = np.empty([train_num_files + train_num_files2 + train_num_files3
                         + train_num_files4 + train_num_files5,_roi_height,_roi_width])
train_labels = np.empty([train_num_files + train_num_files2 + train_num_files3
                         + train_num_files4 + train_num_files5])

#Read BMP file, extract grayscale value, crop and fill into train_images
#Note: This can also be done using keras.image class, specifically the
#keras.image.load_image() and keras.image.img_to_array() methods.
i = 0
for train_image_file in train_names: #Training images, no object.
    #Read original BMP image
    image = plt.imread(train_dir+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 0 #Label value for no object.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')

for train_image_file in train_names2: #Training images, with object on left.
    #Read original BMP image
    image = plt.imread(train_dir2+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 1 #Label value for object on left.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')


for train_image_file in train_names3: #Training images, with object on right.
    #Read original BMP image
    image = plt.imread(train_dir3+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 2 #Label value for object on right.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')

for train_image_file in train_names4: #Training images, with object in front.
    #Read original BMP image
    image = plt.imread(train_dir4+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 3 #Label value for object in front.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')

for train_image_file in train_names5: #Training images, with object blocking the front.
    #Read original BMP image
    image = plt.imread(train_dir5+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 4 #Label value for object blocking the front.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')

#--- Load test images and attach label ---
test_dir = os.path.join('./TestImage/NoObject')
test_names = os.listdir(test_dir)
print("Test file names, 'no object': ", test_names)
print("")
test_num_files = len(os.listdir(test_dir))

test_dir2 = os.path.join('./TestImage/Left')
test_names2 = os.listdir(test_dir2)
print("Test file names, 'with object on left': ", test_names2)
print("")
test_num_files2 = len(os.listdir(test_dir2))

test_dir3 = os.path.join('./TestImage/Right')
test_names3 = os.listdir(test_dir3)
print("Test file names, 'with object on right': ", test_names3)
print("")
test_num_files3 = len(os.listdir(test_dir3))

test_dir4 = os.path.join('./TestImage/Front')
test_names4 = os.listdir(test_dir4)
print("Test file names, 'with object in front': ", test_names4)
print("")
test_num_files4 = len(os.listdir(test_dir4))

test_dir5 = os.path.join('./TestImage/Blocked')
test_names5 = os.listdir(test_dir5)
print("Test file names, 'with object blocking the front': ", test_names5)
print("")
test_num_files5 = len(os.listdir(test_dir5))

#Read BMP file, extract grayscale value, crop and fill into train_images

#Create an empty 3D array to hold the sequence of 2D image data and labels

test_images = np.empty([test_num_files + test_num_files2 + test_num_files3 +
                        test_num_files4 + test_num_files5,_roi_height,_roi_width])
test_labels = np.empty([test_num_files + test_num_files2 + test_num_files3 +
                        test_num_files4 + test_num_files5])

i = 0
for test_image_file in test_names: #Test images, no object.
    #Read original BMP image
    image = plt.imread(test_dir+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 0 #Label value for no object.
    i = i+1
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')

for test_image_file in test_names2: #Test images, with object on left.
    #Read original BMP image
    image = plt.imread(test_dir2+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 1 #Label value for with object on left.
    i = i+1
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')

for test_image_file in test_names3: #Test images, with object.
    #Read original BMP image
    image = plt.imread(test_dir3+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 2 #Label value for with object on right.
    i = i+1
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')

for test_image_file in test_names4: #Test images, with object in front.
    #Read original BMP image
    image = plt.imread(test_dir4+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 3 #Label value for with object in front.
    i = i+1
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')

for test_image_file in test_names5: #Test images, with object blocking the front.
    #Read original BMP image
    image = plt.imread(test_dir5+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 4 #Label value for object blocking the front.
    i = i+1
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')

train_images = train_images/256.0 #Normalize the training image array, and
                                  #convert to floating point.
                                  #Alternatively we can use:
#train_images = train_images.astype('float32')/256
train_images=train_images.reshape(train_num_files + train_num_files2 + train_num_files3 +
                                  train_num_files4 + train_num_files5, _roi_height, _roi_width, 1)

test_images = test_images/256.0 #Normalize the test image array.
test_images=test_images.reshape(test_num_files + test_num_files2 + test_num_files3 +
                                test_num_files4 + test_num_files5, _roi_height, _roi_width, 1)

# Model - CNN with single convolution layer and single max-pooling layer, 2 DNN layers.
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(_layer0_channel, (3,3), strides = 2, activation='relu', input_shape=(_roi_height, _roi_width, 1)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(_DNN1_node, activation='relu'),
    tf.keras.layers.Dense(_DNN2_node, activation='softmax')
    ])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

# Optional, generating a plot of the model. Requires pydot and graphviz to be
# installed. If the path to the graphic file "my_first_model.png" is not given, it
# will be saved to the same folder as this sourcecode.

tf.keras.utils.plot_model(model, 'CNN_model.png', show_shapes = True)
history = model.fit(train_images, train_labels, epochs=30)
model.evaluate(test_images, test_labels)

#Try classify a test image:
classifications = model.predict(test_images)
index1 = 3 #Index to the layers
print("Classify 1 test sample")
print(classifications[index1]) # This prints the values of output nodes for test sample pointed by index1
print(test_labels[index1])     # This is the result after evaluating all the 5 output nodes.

5. Exporting the CNN Weights

Once the neural network model is properly trained, we can then use it to classify a test image as shown in the last section of the codes in Listing 3, which perform the forward propagation calculation or inference with the model when tensorflow.keras.Model.predict( ) method is called. However, instead of using the computer to execute the neural network model, I wish to run the neural network on my own processor. To do this two pieces of information are needed:

The configuration of the neural network, e.g. what each layer in the neural network contains, and how the layers are connected.
A set of weights values (the "state of the model") for each layer, which specify the coefficient for each path from one node in current layer to another node in the subsequent layer.

The first information is easily obtained from the declaration of the CNN model. One way to get the second piece of information is to use the tensorflow.keras.Model.save_weights( ) method [2]. Calling this method will save the weights of all layers on a file, in HF5 format or tensorflow native format. We can then use another software to read the HF5 file. Another more direct approach is to invoke the get_weights( ) method in the tensorflow.keras.layers.Layer class, which output the weights of the layer in the form of a numpy array. Since each layers in the CNN model inherits this class, the get_weights( ) method can be accessed directly from our CNN model as shown in Listing 4. The return value of get_weights( ) is a list of numpy arrays containing the weights or coefficients and bias for each layer. In Listing 4 we also extract the size (e.g. the number of nodes) for each layer, so that the configuration of the CNN can be dynamically changed. All these values can then be saved simply as a C/C++ header file in text format, so that I can read it into my C/C++ codes for my micro-controller firmware. The second part of Listing 4 achieved this by opening a text file and write the elements of the numpy array into the text file using for-loops.

Listing 4 - Python codes accessing the weights of all layers using Tensorflow Keras API and saving the weights into a C style header file.
# Get the weights of all layers.
wt = model.get_weights()

# The highest level index to wt points to the coefficients or weights of each layer.
wtConv2D1 = wt[0]        # Weights of 1st convolution layer.
wtConv2D1bias = wt[1]    # Bias of 1st convolution layer.
wtDNN1 = wt[2]          # Weights of 1st DNN layer.
wtDNN1bias = wt[3]      # Bias of 1st DNN layer.
wtDNN2 = wt[4]          # Weights of 2nd DNN layer.
wtDNN2bias = wt[5]          # Bias of 2nd DNN layer.

Conv2D1filter = wtConv2D1.shape[3] # get no. of filters in 1st convolution layer.
Flattennode = wtDNN1.shape[0]         # get no. of nodes after flatting the convolutional layer.
DNN1node = wtDNN1.shape[1]          # get no. of nodes in 1st DNN layer.
DNN2node = wtDNN2.shape[1]          # get no. of nodes in 2nd DNN layer.

# Open a text file for writing.
f = open("C:\CNN.h","w+")     # Header file to store the coefficients.

# Set the parameters of the filter and other constants in the CNN.
f.write("#define __ROI_STARTX %d \n" % _roi_startx)
f.write("#define __ROI_STARTY %d \n" % _roi_starty)
f.write("#define __ROI_WIDTH %d \n" % _roi_width)
f.write("#define __ROI_HEIGHT %d \n" % _roi_height)
f.write("#define __FILTER_SIZE 3 \n")
f.write("#define __FILTER_STRIDE 2 \n")
f.write("#define __LAYER0_CHANNEL %d \n" % _layer0_channel)
f.write("#define __LAYER0_X %d \n" % ((_roi_width-3)/2 + 1))
f.write("#define __LAYER0_Y %d \n" % ((_roi_height-3)/2 + 1))
f.write("#define __DNN1NODE %d \n" % _DNN1_node)
f.write("#define __DNN2NODE %d" % _DNN2_node)
f.write("\n\n")

N = 3   # Filter size, 3x3

f.write("const int gnL1f[%d][%d][%d] = { \n" % (Conv2D1filter,N,N)) # Integer version
for nfilter in range(Conv2D1filter):
    f.write("{")
    for i in range(3):
        f.write("{")
        for j in range(3):
            f.write("%d" % (wtConv2D1[i,j,0,nfilter]*1000000)) # Scaled integer version.
            if j < (N-1):
                f.write(", ")       # Add a comma and space after every number, except last number.
        f.write("}")
        if i < (N-1):
            f.write(", ")
    if nfilter < (Conv2D1filter - 1):
        f.write("}, \n")
    else:
        f.write("} \n")
f.write("}; \n\n")

# Bias for Conv2D1
f.write("const int gnL1fbias[%d] = {" % Conv2D1filter) # Integer version
for nfilter in range(Conv2D1filter):
    f.write("%d" % (wtConv2D1bias[nfilter]*1000000)) # Scaled integer version
    if nfilter < (Conv2D1filter-1):
        f.write(", ")
f.write("}; \n\n")

# DNN layer 1
# Weights
f.write("const int gnDNN1w[%d][%d] = { \n" % (Flattennode,DNN1node))      # Integer version.
for i in range(Flattennode):
    f.write("{")
    for j in range(DNN1node):
        f.write("%d" % (wtDNN1[i,j]*1000000)) # Scaled integer version.
        if j < (DNN1node - 1):
            f.write(", ")           # Add a comma and space after every number, except last number.
    if i < (Flattennode - 1):
        f.write("},\n")             # Add a newline and '}' after every row.
    else:
        f.write("} \n")
f.write("}; \n\n")

# Bias
f.write("const int gnDNN1bias[%d] = {" % DNN1node) # Scaled integer veresion.
for i in range(DNN1node):
    f.write("%d" % (wtDNN1bias[i]*1000000)) #Scaled to integer.
    if i < (DNN1node - 1):
        f.write(", ")
f.write("}; \n\n")

# DNN layer 2
# Weights
f.write("const int gnDNN2w[%d][%d] = { \n" % (DNN1node,DNN2node)) # Scaled integer veresion.
for i in range(DNN1node):
    f.write("{")
    for j in range(DNN2node):
        f.write("%d" % (wtDNN2[i,j]*1000000)) # Scaled integer veresion.
        if j < (DNN2node - 1):
            f.write(", ")           # Add a comma and space after every number, except last number.
    if i < (DNN1node - 1):
        f.write("},\n")             # Add a newline and '}' after every row.
    else:
        f.write("} \n")
f.write("}; \n\n")

# Bias
f.write("const int gnDNN2bias[%d] = {" % DNN2node)
for i in range(DNN2node):
    f.write("%d" % (wtDNN2bias[i]*1000000)) # Scaled to integer veresion.
    if i < (DNN2node - 1):
        f.write(", ")
f.write("}; \n\n")

f.close()

An example of the weights produced by the code wt = model.get_weights( ) and the resulting header file are shown in Figure 5. Some salient point concerning the header file:

Notice that I convert the values of the weigths and biases from 32-bit floating point to integer. This is achieve by multiplying each floating point value with 1000000 to return 6 decimal places or 6 significant digits of the value. The computation in the micro-controller will be performed using integer maths to increase the throughput.
All the weights and biases are declared using the constant modifier, this will force the C/C++ compiler to save the values in the non-volatile memory of the micro-controller, e.g. the Flash memory. During the forward propagation calculation in the micro-controller, computation will be performed layer-by-layer. Thus, only the weight and bias values pertinent to the layer concerned will be loaded from Flash memory to the micro-controller RAM, reducing the RAM memory demand on the micro-controller.

Figure 5 - Comparing the weights generated by Tensorflow for the CNN model and the content of the C/C++ header file exported (The header file is named "CNN.h").

6. Performing Inference Operation in Micro-Controller

Once the header file is generated, we can then incorporate it into our firmware sourcecode for the micro-controller (MCU) in the machine vision module. The integer weights or coefficients for the CNN will be stored in the Flash memory of the MCU due to the const modifier. The firmware in the MCU will first loads the required integer weights from the Flash memory into the RAM memory, and then using for-loops to calculate the output of each 2D convolution filter or nodes. The complete high-level flow for this system is illustrated in Figure 6.

Figure 6 - High level view of implementing the CNN routines in machine vision module.

There are many approaches to implementing the C-language neural network inference subroutine in the MCU of the machine vision module. The approach that I used is shown in Figure 7. The inference subroutine needs to complete the execution within 50 ms for a frame rate of 20 fps. In actual implementation there are other tasks that need to be executed on a periodic basis in parallel with the neural network inference subroutine, such as tasks that compress and stream the pixels in the video buffer to external display, the camera driver and image pre-processing routines. Hence, I actually break the flow into a series of smaller parts so that the inference routine will not hog the processor. For example, in Figure 7, after calculating the output of each 2D convolution filter or node, we can return the control to the RTOS or scheduler so that other tasks can run. All-in-all, from experiment, the other tasks require roughly 10 ms or 20% of the MCU bandwidth within 1 frame interval, with roughly 40 ms left for the neural network inference subroutine.

Figure 7 - Detailed flow of the C inference subroutine.

To speed things up here are the key points that in the C-language neural network inference subroutine, here are some features that I used:

Weigths and node values are stored as 32-bits signed integer in RAM.
All linear arithmetic operations use 32-bits signed integer operations. The intermediate result of multiply-and-accumulation operation should be stored in 64-bits signed integer format to prevent overflow. The final result will be normalized back to 32-bits signed integer.
Some codes in the neural network inference subroutine are implemented in assembly language, with call to subroutine using C inline directive. For example in Listing 5, the function to calculate the 2D convolutional filter output is implemented using the __MALD assembly opcode [3] for ARM Cortex M7, which perform multiply and sum operation in one instruction cycle.
Instead of using SoftMax function at the output as in Figure 7, I just used a sort algorithm to find the maximum value in 3rd layer, basically implementing a max( ) function, to get the output. Softmax( ) function is used during training of the network because it is differentiable, allowing it to be used with back-propagation algorithm.

Listing 5 - A C-code listing for implementing 2D convolution operation and activation function in ARM Cortex-M7.

// Function to compute the 3x3 convolution operation on a small region
// of the image buffer.
// ni, nj = (x,y) coordinate of start pixel in 3x3 image patch.
// nFilA = Address of array containing the 9 coefficients of the 2D convolution kernel or filter.
// nBias = Bias value.

__INLINE int   nConv2D(int ni, int nj, int * nFilA, int nBias)
{
   int   nLuminance[9];
   int    nTemp;

   if (gnValidFrameBuffer == 1)                   // Check frame buffer data valid flag. If equals 1 means gunImgAtt2[] data
   {
       nLuminance[0] = gunImgAtt2[ni][nj] & _LUMINANCE_MASK; // Extract the 7-bits luminance value.
       nLuminance[1] = gunImgAtt2[ni+1][nj] & _LUMINANCE_MASK;
       nLuminance[2] = gunImgAtt2[ni+2][nj] & _LUMINANCE_MASK;
       nLuminance[3] = gunImgAtt2[ni][nj+1] & _LUMINANCE_MASK;
       nLuminance[4] = gunImgAtt2[ni+1][nj+1] & _LUMINANCE_MASK;
       nLuminance[5] = gunImgAtt2[ni+2][nj+1] & _LUMINANCE_MASK;
       nLuminance[6] = gunImgAtt2[ni][nj+2] & _LUMINANCE_MASK;
       nLuminance[7] = gunImgAtt2[ni+1][nj+2] & _LUMINANCE_MASK;
       nLuminance[8] = gunImgAtt2[ni+2][nj+2] & _LUMINANCE_MASK;
   }
   else
   {
       nLuminance[0] = gunImgAtt[ni][nj] & _LUMINANCE_MASK; // Extract the 7-bits luminance value.
       nLuminance[1] = gunImgAtt[ni+1][nj] & _LUMINANCE_MASK;
       nLuminance[2] = gunImgAtt[ni+2][nj] & _LUMINANCE_MASK;
       nLuminance[3] = gunImgAtt[ni][nj+1] & _LUMINANCE_MASK;
       nLuminance[4] = gunImgAtt[ni+1][nj+1] & _LUMINANCE_MASK;
       nLuminance[5] = gunImgAtt[ni+2][nj+1] & _LUMINANCE_MASK;
       nLuminance[6] = gunImgAtt[ni][nj+2] & _LUMINANCE_MASK;
       nLuminance[7] = gunImgAtt[ni+1][nj+2] & _LUMINANCE_MASK;
       nLuminance[8] = gunImgAtt[ni+2][nj+2] & _LUMINANCE_MASK;
   }

   // Convolution or cross-correlation operation with 3x3 filter. We try to avoid using for-loop to speed up the computation.
   // Note: 24 April 2020, I have tried a few approaches, using C codes without for-loop. Verified that this method is the
   // fastest, from a few tens of microseconds to a few microseconds! This method forces the compiler to use the 32-bits signed
   // integer multiply and accumulate assembly instruction of the Cortex M7 core, making it the most efficient.

   nTemp = (*nFilA)*(*nLuminance);                // Correlation operation.
   nTemp = __MLAD(*(nFilA+1),*(nLuminance+1),nTemp);   // Using assembly multiply and accumulate instruction.
   nTemp = __MLAD(*(nFilA+2),*(nLuminance+2),nTemp);
   nTemp = __MLAD(*(nFilA+3),*(nLuminance+3),nTemp);
   nTemp = __MLAD(*(nFilA+4),*(nLuminance+4),nTemp);
   nTemp = __MLAD(*(nFilA+5),*(nLuminance+5),nTemp);
   nTemp = __MLAD(*(nFilA+6),*(nLuminance+6),nTemp);
   nTemp = __MLAD(*(nFilA+7),*(nLuminance+7),nTemp);
   nTemp = __MLAD(*(nFilA+8),*(nLuminance+8),nTemp);
   nTemp = (nTemp/128) + nBias;                   // Add Bias term and normalized by 128. The luminance value
                                                   // ranges from 0 to 127, during training of the CNN we normalize
                                                   // by 128 to make it between 0 to 1.0. So in inference we should
                                                   // do the same.

   // ReLu activation function
   if (nTemp < 0)
   {
       nTemp = 0;
   }
   return nTemp;
}

7. Conclusion and Sample Files

The system is pretty versatile. As of June 2020, I have collected around 200 images for the 5 classes and can achieve accuracy between 85 to 92% in actual usage. I have also reduced the output class to two, and simply used the system to detect whether an obstacle is present or not in front of the machine vision module. On another interesting note, with the two output class source codes, I have used this system to check for presence or absence of human face by just replacing the training images. Unfortunately, due to the low resolution of the camera and the shallow neural network architecture, the system is not able to differentiate between human faces, merely acknowledging the absence or presence. The python codes, sample training and test images can be obtained from MVM V1.5C github project repository here.

References

1. U. Hiwarale, "Bits to bitmaps: A simple walkthrough of bitmap image format", 2019. https://itnext.io/bits-to-bitmaps-a-simple-walkthrough-of-bmp-image-format-765dc6857393
2. TensorFlow online documentation, March 2020 version. https://www.tensorflow.org/guide
3. ARM Cortex-M7 devices generic user guide, 2015. https://developer.arm.com/documentation/dui0646/b/