Vision

MyGym enables you to use pre-trained vision models to extend the versatility of your training scenarios. The vision models can be used instead of ground truth data from simulator to retrieve information about the environment where robot performs its task. Vision models take simulator’s camera data (RGB and/or depth image) as inputs to inference and return information about observed scene. Thanks to that, your training becomes independent on ground truth from simulator and can be therefore easier transfered to real robot tasks.

MyGym integrated two different vision modules - YOLACT and VAE - and you can alternate between ground truth and these when specifying the type of source of reward signal in config file or as a command line argument: reward_type= either gt (ground truth) or 3dvs (YOLACT) or 2dvu (VAE).

YOLACT

Mygym implements YOLACT 1 for instance segmantation. If 3dvs is chosen for reward_type, the pre-trained YOLACT model is used to get observations from the environment. The input into YOLACT inference is RGB image rendered by the active camera, the inference results are masks and bounding boxes of detected objects. The vision module further calculates the position of centroids of detected objects in pixel space. Lastly, the vision module utilizes the depth image from the active camera to project the object’s centroid into 3D worl coordinates. This way, the absolute position of task objects is obtained only from sensory data without any ground truth inputs.

The current pre-trained model can detect all Objects and three of Robots including their grippers (kuka, jaco, panda).

If you would like to train new YOLACT model, you can use prepared dataset generator available in myGym, see Generate dataset. For instructions regarding training itself, visit YOLACT home page.

1

Daniel Bolya, Chong Zhou, Fanyi Xiao, & Yong Jae Lee (2019). YOLACT: Real-time Instance Segmentation. In ICCV.

VAE

The objective of an unsupervised version of the prepared tasks (reach task, push task, pick and place etc.) is to minimize the difference between the actual and goal scene images. To measure their difference, we have implemented a variational autoencoder (VAE) that compresses each image into an n-dimensional latent vector. Since the VAE is optimized so that it preserves similarities among images also in the latent space (scenes with objects close to each other will have their encoded vectors also closer to each other), it is possible to measure the euclidean distance between the encoded scenes and use it for reward calculation - i.e., the smaller the euclidean distance between actual and goal image, the higher the reward. Please note that the limitation of using VAE is that it works conveniently only with 2D information - i.e., it is a very weak source of visual information in 3D tasks such as pick and place.

We provide a pretrained VAE for some of the task scenarios, but we also include code for training of your own VAE (including dataset generation), so that you can create custom experiments. To learn how to train your robot with the pretrained weights, see Train a robot - unsupervised vision

How to train a custom VAE

You are free to train your own VAE with a custom set of objects, level of randomisation, background scene or type of robot. Here we describe how.

Generating a dataset

To generate a VAE dataset, run the following script:

python generate_dataset.py configs/dataset_vae.json

All the dataset parameters shall be adjusted in configs/dataset_vae.json. They are described in comments, so here we highlight the most important ones:

  • output_folder: where to save the resulting dataset

  • imsize: the resulting square image size, that will be saved. We currently only support VAE architectures for imsize of 128 or 64. The cropping of the image is done atomatically and can be adjusted in the code.

  • num_episodes: corresponds to the overall number of images in the dataset (in case the make_shot_every_frame parameter is set to 1)

  • random_arm_movement: whether to move the robot randomly, otherwise it stays fixed in its default position

  • used_class_names_quantity: what kind of objects do you want to show in the scene and how often. The names correspond to the urdf object names in the envs/objects directory. The first number in each list corresponds to the frequency, i.e. 1 is a default frequency and values above 1 make the object appear more ofthen than the others.

  • object_sampling_area: set the area in which the selected objects will be sampled; the format is xxyyzz

  • num_objects_range: in each image taken, a random number of objects from this range will appear in the scene

  • object_colors: if you have a color randomizer enabled and want some objects to have fixed color, you can set it up here

  • active_camera: the viewpoint from which the scene will be captured. The number 1 defines the active camera that will be used. We currently only support one camera viewpoint for dataset generation.

Training VAE

Once you have your dataset ready, you can continue with VAE training. This is handled with the following script:

python train_vae.py --config vae/config.ini --offscreen

The –offscreen parameter turns off any kind of visualisation, so if you want to see the progress, do not use it. Otherwise, all parameters can be set in the config.ini file as follows:

  • n_latents: the dimensionality of the latent vector z

  • batch_size: choose any integer

  • lr: the learning rate

  • beta: size of the beta paremeter to induce disentanglement of the latent space as proposed in this paper

  • img_size: size of the square images to train on. Currently the only supported sizes are 64 or 128

  • use_cuda

  • n_epochs: the number of training epochs

  • viz_every_n_epochs: how often to save the image reconstruction to monitor the training progress

  • annealing_epochs: the number of epochs for which to gradually increase the impact of KLD term in the ELBO loss. See this paper for more info

  • log_interval: how often to print out the log about the training progress in the console

  • test_data_percentage: the fraction of the training dataset that will be used as testing data

  • dataset_path: path to the dataset folder containing images

The trained VAE will be saved in the ciircgym/vae/trained_models/ folder, along with the config used for the training and visualisations.

class myGym.envs.vision_module.VisionModule(vision_src='ground_truth', env=None, vae_path=None, yolact_path=None, yolact_config=None)[source]

Vision class that retrieves information from environment based on a visual subsystem (YOLACT, VAE) or ground truth

Parameters:
param vision_src

(string) Source of information from environment (ground_truth, yolact, vae)

param env

(object) Environment, where the training takes place

param vae_path

(string) Path to a trained VAE in 2dvu reward type

param yolact_path

(string) Path to a trained Yolact in 3dvu reward type

param yolact_config

(string) Path to saved Yolact config obj or name of an existing one in the data/Config script or None for autodetection

get_module_type()[source]

Get source of the information from environment (ground_truth, yolact, vae)

Returns:
return source

(string) Source of information

crop_image(img)[source]

Crop image by 1/4 from each side

Parameters:
param img

(list) Original image

Returns:
return img

(list) Cropped image

get_obj_pixel_position(obj=None, img=None)[source]

Get mask and centroid in pixel space coordinates of an object from 2D image

Parameters:
param obj

(object) Object to find its mask and centroid

param img

(array) 2D input image to inference of vision model

Returns:
return mask

(list) Mask of object

return centroid

(list) Centroid of object in pixel sprace coordinates

get_obj_bbox(obj=None, img=None)[source]

Get bounding box of an object from 2D image

Parameters:
param obj

(object) Object to find its bounding box

param img

(array) 2D input image to inference of vision model

Returns:
return bbox

(list) Bounding box of object

get_obj_position(obj=None, img=None, depth=None)[source]

Get object position in world coordinates of environment from 2D and depth image

Parameters:
param obj

(object) Object to find its mask and centroid

param img

(array) 2D input image to inference of vision model

param depth

(array) Depth input image to inference of vision model

Returns:
return position

(list) Centroid of object in world coordinates

get_obj_orientation(obj=None, img=None)[source]

Get object orientation in world coordinates of environment from 2D image

Parameters:
param obj

(object) Object to find its mask and centroid

param img

(array) 2D input image to inference of vision model

Returns:
return orientation

(list) Orientation of object in world coordinates

vae_generate_sample()[source]

Generate image as a sample of VAE latent representation

Returns:
return dec_img

Generated image from VAE latent representation

encode_with_vae(imgs, task='reach', decode=0)[source]

Encode the input image into an n-dimensional latent variable using VAE model

Parameters:
param imgs

(list of arrays) Input images

param task

(string) Type of learned task (reach, push, …)

param decode

(bool) Whether to decode encoded images from latent representation back to image array

Returns:
return latent_z

(list) Latent representation of images

return dec_img

(list of arrays) Decoded images from latent representation back to image arrays

inference_yolact(img)[source]

Infere using YOLACT model

Parameters:
param img

(array) Input 2D image

Returns:
return classes

(list of ints) Classes IDs of detected objects

return class_names

(list of strings) Classes names of detected objects

return scores

(list of floats) Scores (confidence) of object detections

return boxes

(list of lists) Bounding boxes of detected objects

return masks

(list of lists) Masks of detected objects

return centroids

(list of lists) Centroids of detected objects in pixel space coordinates