The world that we interact with each and every day is three-dimensional, but the majority of deep learning models process visual data as 2D images. However, there are some neural network architectures that are capable of processing 3D structures directly. An early approach was presented at the Conference on Computer Vision and Pattern Recognition (CVPR) in 2017 and is called PointNet.
This blog post touches on two topics: First, we will have a look at 3D deep learning with PointNet. The creators of PointNet have also released some code on GitHub, providing a TensorFlow 1.x implementation of PointNet. Since TensorFlow 2.0 was released on September 30, 2019, we will transform this original PointNet implementation into an idiomatic TensorFlow 2 implementation in the second part of the post. You can find all code examples discussed in this blog post on GitHub.
So let’s get started with an overview of the PointNet architecture. If you are not yet familiar with the fundamentals of deep learning systems, check out our blog posts about Deep Learning Fundamentals and the concepts and methods of artificial neural networks.
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
Point clouds are an important data structure in 3D computer vision; some of the most important use cases are in autonomous driving and robotics, where 3D information is needed to navigate in and interaction with the real world. A point cloud is an unordered set of spatial coordinates, X, Y and Z, without any explicit geometric relationship between points. However, current machine learning models assume a well defined, structured input, which makes it impossible to feed point clouds directly into them. This led to some form of resampling in terms of 3D data modality, like 3D occupancy or voxel grid sampling, leading to information loss due to the quantization of fine granular point cloud data. Besides, these techniques are computationally expensive and may lead to a much larger memory footprint than raw point cloud data.
The PointNet neural network architecture contains special building blocks that remedy the need for a resampled input format. We will discuss them in the following sections. But first, have a look at the following graphic, which displays the whole PointNet architecture. All the mlp(p,q,...)
layers prior to the max pooling are actually \(1 × 1 \) convolutional layers: they are like fully connected “vanilla“ MLP layers, where each operates on a single point in the point cloud. PointNet computes a global feature vector using a max pooling layer. This makes the model invariant to the number of input points, since the classification is only based on one 1024-dimensional vector. One really nice thing about PointNet is that you can use the same base network for classification and segmentation tasks. The latter can be achieved by concatenating the global feature with each point feature and subsequently computing new point features based on this combined feature. This results in per-point features that contain both, local geometric information and global semantics. With small modifications to the architecture, the model is also capable of computing object part segmentations. Given a 3D scan, the task of object part segmentation is to assign part category labels (e.g. chair leg, cup handle) to each point.
Now that you know the basic architecture of PointNet, we can have a look at what makes this architecture so special. The invariance against input size and order is accomplished by the application of \(1 × 1 \) convolutions and a symmetrical function, which maps an \(N × D \)-dimensional input tensor to one global M-dimensional feature vector. \(1 × 1 \) convolutions (that technically aren’t convolutions at all), are especially useful to alter the amount of channels of your input. For example, consider an input image of shape \(n_h × n_w × n_c \). After you have applied a convolutional layer with p \(1 × 1 \) filters, your output will be of shape \(n_h × n_w × p \). Another view of a \(1 × 1 \) convolution is that it takes an \(n_c \)-dimensional vector as input and transform it into a p-dimensional output vector, much like a dense neural network does. For point clouds, imagine convolving over the per-point features while increasing or decreasing the amount of features which are independent of the overall number of points in the point cloud. Clearly, this leads to a fully convolutional architecture that can be used to compute exactly the same features, regardless of the input dimensionality.
All the point features are reduced to one global feature vector. To achieve this, global max pooling is used as symmetrical function in PointNet, which extracts only the largest component of each of the D-dimensional vectors, resulting in one D-dimensional vector containing a signature of the input space. Have a look at the graphic below for a visualization of this approach.
The third and maybe most confusing element of the PointNet architecture may be the little guy in the image’s lower left, called T-Net. The appendix of the PointNet paper describes it as mini-PointNet. It actually consists of the same building blocks as its big brother, but what does it do? Well, the process of semantic segmentation needs to be invariant to geometric transformations applied to the input. T-Net is used to achieve exactly this. It estimates an orthogonal transformation matrix, which is subsequently applied to the input coordinates. The result is a well aligned representation of the input points. The same transformation is applied to the 64-dimensional feature space in the second T-Net. However, since the dimensionality of the feature space is much higher than that of the input space, the authors included a regularization term to stabilize the optimization process and ensure that the estimated matrix becomes close to an orthogonal rotation matrix. This is desired, as orthogonal transformations will not lose information in the input.
Before we move on and make the original implementation TensorFlow 2.x compatible, let us have a quick look at the original implementation of PointNet. The structure of the official GitHub-Repository divides the code into four different modules. The core model definitions reside in the model
module, which provides each definition in a single python file. Training and evaluation of the classification can be done using the train.py
and evaluate.py
files respectively, which reside in the top level directory. The semantic segmentation, part segmentation together with their training, data preparation and evaluation scripts all reside in the sem_seg
and part_seg
modules. The utils
module contains some utility functions used in the model definitions or data preparation. We will alter this structure a little bit later on in order to reduce the amount of code duplication. But first, let’s have a look at the changes applied to the TensorFlow API.
TensorFlow 2.x
I’ve read many blog posts about the changes applied to TensorFlow with the official release of version 2.0 on September 30, 2019. By the time of this writing, the latest version of TensorFlow is 2.1, the last one to officially support Python 2.7. All the blog posts I have read did a really nice job at pointing out all the major changes in the TensorFlow API. I will link some of the posts that I enjoyed the most. However, I still want to recapture the most important changes for you:
- (tf.)keras is the preferred way to define models. Although the Keras API was integrated into TensorFlow since release 1.10, TF 2.0 marks the end of the Keras standalone era. The APIs are now fully in sync and the Keras repository will not receive any further updates. So make sure to change your import statements!
- No more globals! Finally, all the mess with implicit global namespaces in TensorFlow 1.x is gone! However, this means that you are now responsible to track your variables. If you somehow lose them, they are gone! Luckily, you can bypass this extra effort of variable tracking by simply using Keras objects as your main building blocks.
- Bye bye, Session API! My personal favorite comes with the introduction of Eager Execution to TensorFlow as default computation mode. It felt always way too cumbersome to define the whole model ahead of time and debug it with print statements. TensorFlow 2.x introduces the
tf.function()
decorator to mark a function for JIT compilation, so that TensorFlow runs it as a single graph. The model definition process feels now much more pythonic!
So let’s convert PointNet to TensorFlow 2.x code. I will further refer to TensorFlow 2.x as TF2.
Automatically upgrade using the conversion script
TensorFlow 2 ships with a conversion script to automatically make your existing TensorFlow 1.x projects compatible with the API changes of TF2. I forked the original PointNet repository and moved it into the original folder of the attached GitHub repository. I am now able to upgrade the project with the following command:
1 |
tf_upgrade_v2 --intree original --outtree tf2 --reportfile tf2_upgrade_report.txt |
The script traverses through all python files in the specified intree directory and string-replaces all old TensorFlow 1.x API calls with its TF2 equivalents. Whenever it encounters problems during the conversion, it logs a hint to the problematic piece of code in the tf2_upgrade_report.txt
file. So we will first start investigating that file.
Most of the lines in the log file contain information about renamed API calls or changes of the parameter list. For instance, the following output arises while parsing the file containing the semantic segmentation model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
-------------------------------------------------------------------------------- Processing file 'original/sem_seg/model.py' outputting to 'tf2/sem_seg/model.py' -------------------------------------------------------------------------------- 13:21: INFO: Renamed 'tf.placeholder' to 'tf.compat.v1.placeholder' 15:16: INFO: Renamed 'tf.placeholder' to 'tf.compat.v1.placeholder' 64:11: INFO: Added keywords to args of function 'tf.reduce_mean' 68:12: INFO: Renamed 'tf.placeholder' to 'tf.compat.v1.placeholder' 70:13: INFO: Renamed 'tf.Session' to 'tf.compat.v1.Session' 71:19: INFO: Renamed 'tf.global_variables_initializer' to 'tf.compat.v1.global_variables_initializer' -------------------------------------------------------------------------------- |
Some other changes need manual checks, for example all *.save()
calls. However, this applies only if we try to save Keras models, because in TF2, Keras’ model.save()
writes the model definition in the format of a TensorFlow SavedModel instead of a HDF5. PointNet only uses model checkpointing, so we can skip this for now and move on to the next warning, which requires us to check all tf.get_variable()
calls.
In TF2, tf.get_variable()
returns a tf.ResourceVariable
instead of a tf.Variable
, which has well-defined semantics and is stricter about shapes. This is implemented using a read_value
operation that is added to the graph. The tensors returned by this operation are guaranteed to see all modifications applied to the variable from operations on which the read_value
depends. Additionally, the Tensors are guaranteed not to adapt any modification applied to the variable from operations that depend on the read_value. In other words, operations behave as if they were executed in some total order consistent with the partial ordering constraints enforced by the graph. The whole semantics are described in the Concurrency Semantics For TensorFlow Resource Variables. The example below shows the different behavior of TensorFlow Variables in Version 1.x and 2.x.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import tensorflow as tf with tf.compat.v1.Session() as sess: v1 = tf.Variable(0) v2 = tf.Variable(0) upd1 = tf.compat.v1.assign(v1, v2 + 1) upd2 = tf.compat.v1.assign(v2, v1 + 1) init = tf.compat.v1.global_variables_initializer() for i in range(10): sess.run(init) sess.run([upd1, upd2]) print(*sess.run([v1, v2])) |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import tensorflow as tf v1 = tf.Variable(0) v2 = tf.Variable(0) upd1 = tf.compat.v1.assign(v1, v2 + 1) upd2 = tf.compat.v1.assign(v2, v1 + 1) for i in range(10): print(v1.numpy(), v2.numpy()) |
On the left side you can find TensorFlow 1.x code and on the right side equivalent TF2 code. However, they behave differently. The result of the left side could be either 1, 1 for v1 and v2 respectively, or it could result in 1, 2 or even 2, 1. The right side in contrast always prints the result 1, 2.
In the PointNet implementation, tf.get_variable()
calls are used to initialize the model parameters of the T-Net. We will refactor this code to Keras-layers, so we don’t have to track all variables by ourselves. Moreover, we do not have to care about those warnings in the report file at all, because we won’t do anything special with the variables, like control scopes, etc.
Believe it or not, these are the only things that we really have to check for our PointNet implementation and the code is already running in a TF2 environment. However, there are still some things that require a manual check, like all tf.summary
calls that are not automatically convertible by the script. We will deal with them later, when we refactor our training script.
Our current TF2 PointNet implementation is basically the TF2 implementation with changed API calls. It doesn’t take advantage of any of the new features that make TF2 so advantageous. In the following sections, I describe how we can achieve a “TF2-native“ implementation. Consequently, we will get concise model definitions and training/evaluation loops.
Make the code TensorFlow 2.0-native
Let’s start the conversion with the utils module. It contains several implementations of convolutional layers, especially different kind of convolutions with batch normalization, dropout, max- and average pooling and fully connected layers. Since we will use tf.keras
to build our TF2 PointNet, we will remove most of the utility functions in this module and replace them with their Keras equivalent. We will only keep the definitions for convolutional and fully connected layers with batch normalization. Have a look at the resulting code for the convolutional layer, which looks like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
def conv1d_bn(x, num_filters, kernel_size, padding='same', strides=1, use_bias=False, scope=None, activation='relu'): """ Utility function to apply Convolution + Batch Normalization. """ with K.name_scope(scope): input_shape = x.get_shape().as_list()[-2:] x = Conv1D(num_filters, kernel_size, strides=strides, padding=padding, use_bias=use_bias, input_shape=input_shape)(x) x = BatchNormalization(momentum=0.9)(x) x = Activation(activation)(x) return x |
The code for the dense layers is updated analogously. The original implementation uses 2D convolutions instead of 1D convolutions, which would be a better fit to the point cloud data structure in my opinion. Therefore, we will use 1D convolutions in our implementation. Interestingly, a short look in the TensorFlow documentation reveals that even though we call 1D convolutions, the data is reshaped and a 2D convolution is computed by TensorFlow. So our code will still be equivalent to the original implementation, but more expressive.
We will not alter the other files in this module, since they do not contain any TensorFlow code. So let’s move on and refactor the network architectures (models module) to Keras models.
First, we will refactor the T-Net implementations. They only differ in the size of the transformation matrix and the additional regularization term in the feature transform. Therefore, we will wrap the T-net in a function and make the size of the matrix and the regularization configurable. For the latter, we will define a Keras Regularizer subclass in the utility package.
Subsequently, we will separate out the base model up to the global feature vector as it is used in both the segmentation and classification task. We therefore define it as a function with configurable input dimension. It returns the output of the global max pooling operation and the output of the applied feature transform.
Second, we refactor the task specific networks. We start with the classification network. Therefore, we call the base model that we have just extracted and add the fully connected classification part on top. We will do the same for the semantic segmentation network. The part segmentation network is a little bit more complicated, because it has multiple inputs. Alongside the points, we also feed a one-hot encoded vector into the network, which indicates the object part category. According to the PointNet creators, this is due to the lack of training data in the ShapeNet part data set. As a result, the task reduces to predicting object parts for a known object category instead of predicting object parts across all possible categories.
We use the Keras functional API to define our models. This enables us to define small building blocks of PointNet as functions that we can reuse in different parts of the models. Another possible way to define the PointNet Architecture would be to subclass tf.keras.Model
, as this provides a higher level of architectural flexibility. However, the model subclassing approach is considered harder to write and debug. The code is also more verbose, because you have to define the model architecture and the forward pass separately. Therefore, we use the functional API, as it provides enough flexibility to implement the PointNet architecture. The following snippet shows the implementation of the base model, which is shared across all task specific network architectures.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
def get_model(inputs): """ Convolutional portion of model, common across different tasks (classification, segmentation, etc) :param inputs: Input tensor with the point cloud shape (BxNxK) :return: tensor layer for CONV5 activations, tensor layer with local features """ # Obtain spatial point transform from inputs and convert inputs ptransform = transform_net(inputs, scope='transform_net1', regularize=False) point_cloud_transformed = Dot(axes=(2, 1))([inputs, ptransform]) # First block of convolutions net = conv1d_bn(point_cloud_transformed, num_filters=64, kernel_size=1, padding='valid', use_bias=True, scope='conv1') net = conv1d_bn(net, num_filters=64, kernel_size=1, padding='valid', use_bias=True, scope='conv2') # Obtain feature transform and apply it to the network ftransform = transform_net(net, scope='transform_net2', regularize=True) net_transformed = Dot(axes=(2, 1))([net, ftransform]) # Second block of convolutions net = conv1d_bn(net_transformed, num_filters=64, kernel_size=1, padding='valid', use_bias=True, scope='conv3') net = conv1d_bn(net, num_filters=128, kernel_size=1, padding='valid', use_bias=True, scope='conv4') hx = conv1d_bn(net, num_filters=1024, kernel_size=1, padding='valid', use_bias=True, scope='hx') # add Maxpooling here, because it is needed in both nets. net = GlobalMaxPooling1D(data_format='channels_last', name='maxpool')(hx) return net, net_transformed |
Last but not least, we have to refactor the training scripts. Finally, we can get rid of the sessions. We will train and evaluate the models using Keras. Remember that we still need to check the tf.summary
calls? We bypass this by using the TensorBoard callback to let Keras do the logging for us. Thus, we don’t need any tf.summary
call at all. We replace the training and evaluation functions of the original implementation with Keras model.fit_generator calls. Therefore, we need to write a data provider to feed training and evaluation samples to our model. Due to the large size of ModelNet-40, we create a HDF5 Virtual Dataset, so that we do not need to load all point clouds into memory. Our data generator is a subclass of Keras Sequence. It contains all the augmentation and sampling logic. If you aren’t familiar with Keras data Providers consider reading this introduction. The following snippets contains the full training pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
def train(): model = MODEL.get_model((None, 3), NUM_CLASSES) learning_rate = get_learning_rate_schedule() optimizer = tf.keras.optimizers.Adam(learning_rate) PointCloudProvider.initialize_dataset() generator_training = PointCloudProvider('train', BATCH_SIZE, n_classes=NUM_CLASSES, sample_size=MAX_NUM_POINT) generator_validation = PointCloudProvider('test', BATCH_SIZE, n_classes=NUM_CLASSES, sample_size=MAX_NUM_POINT) callbacks = [ tf.keras.callbacks.ModelCheckpoint(CKPT_DIR, save_weights_only=False, save_best_only=True), tf.keras.callbacks.TensorBoard(LOG_DIR) ] model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['categorical_accuracy']) model.fit_generator(generator=generator_training, validation_data=generator_validation, steps_per_epoch=len(generator_training), validation_steps=len(generator_validation), epochs=MAX_EPOCH, callbacks=callbacks, use_multiprocessing=False) model.save("trained_model.pb") |
Conclusion
We have successfully converted the original PointNet implementation to TensorFlow 2.0. Well, because of the switch from a native TensorFlow 1.x implementation to the use of tf.keras it felt more like a complete reimplementation, but I learned a lot about the internals of TensorFlow 2 during the conversion process. The final result can be found on GitHub. You can find both, the original and the TensorFLow 2 implementation in the repository. I also included the full output of the conversion script. Feel free to dig through the code and use the implementation for your own 3D Deep Learning, TensorFlow 2 projects.