Controllable Visual-Tactile Synthesis

ICCV 2023

Carnegie Mellon University

Content creation beyond visual output: We present Visual-Tactile-Synthesis, an image-to-image translation method to synthesize visual appearance and tactile geometry of different materials, given a handcrafted or DALL⋅E 2 sketch. The visual and tactile outputs can be rendered on a surface haptic device like TanvasTouch® where users can slide on the screen to feel the rendered textures. Turn the audio ON to hear the sound of the rendering. [Download the video here]

We show the colored mesh for each object. The mesh is exaggerated in z direction to show fine textures.

Abstract

Deep generative models have various content creation applications such as graphic design, e-commerce, and virtual Try-on. However, current works mainly focus on synthesizing realistic visual outputs, often ignoring other sensory modalities, such as touch, which limits physical interaction with users.

In this work, we leverage deep generative models to create a multi-sensory experience where users can touch and see the synthesized object when sliding their fingers on a haptic surface. The main challenges lie in the significant scale discrepancy between vision and touch sensing and the lack of explicit mapping from touch sensing data to a haptic rendering device.

To bridge this gap, we collect high-resolution tactile data with a GelSight sensor and create a new visuotactile clothing dataset. We then develop a conditional generative model that synthesizes both visual and tactile outputs from a single sketch. We evaluate our method regarding image quality and tactile rendering accuracy. Finally, we introduce a pipeline to render high-quality visual and tactile outputs on an electroadhesion-based haptic device for an immersive experience, allowing for challenging materials and editable sketch inputs.

Results: Visual-Tactile Synthesis with interactive patches

We show one example of FlowerJeans with clickable patches. Choose an image format from the "file" dropdown list, then click on the image to see the corresponding patches. We show patches of sketch, visual image, tactile gx, tactile gy, and the normal map. The top rows the ground truth and the bottom rows are our synthesized results.

Visual Tactile gx Tactile gy Normal
Ground Truth real_visual real_gx real_gy real_normal
Ours fake_visual fake_gx fake_gy fake_normal

Results: swapping different materials

Here we show the generated visual and tactile output. The leftmost column shows the testing sketch input, and the rest columns show the generated visual and tactile outputs. The tactile outputs are shown in surface normal maps to represent the geometry information. Our model can generate high-quality visual and tactile outputs for various clothing categories, in terms of both global shape and local textures.

WhiteVest_sketch
WhiteVest_sketch
WhiteVest_sketch
WhiteVest_sketch

Results: text-guided visual-tactile synthesis

We also extend our method to synthesizing visual-tactile outputs given both sketches and text prompts. We first use DALL⋅E 2 to create variations of an original sketch in the TouchClothing dataset and then feed the edited sketches to our conditional generative models. Sample results are shown below, with paired visual image and tactile normal map. The top row shows the training object and the text prompt is listed below each testing sketch. Despite the unseen shapes of pockets and embroideries, our model generalizes well to match visual and tactile features with the shape information embedded in the sketch.

Objects

a black-white sketch of a hoodie, single-pixel width stroke

WhiteVest_sketch

a black-white sketch of a shirt with daisy-shape patterns.

WhiteVest_sketch

a sketch of a pair of jeans

WhiteVest_sketch

a sketch of a pair of pants with two pockets, pockets are in the shape of a cat head

Data Acquisition and TouchClothing Dataset

Sensor Setup: We obtain synchronized visual-tactile data by registration with Aruco marker tracking. We use a GelSight R1.5 sensor to capture tactile local geometry. The sensor outputs gx and gy represent the image gradient of the surface normal in x and y directions, respectively. We use a PiCamera mounted on the top for visual capture and aruco marker detection. The video shows a detailed demo.

Data Format: For each object, we have one full sketch, one full visual image, and about 200 tactile patches, covering about 1/6 of the total area.

Dataset: Our dataset contains 20 garments in total, covering various fabrics commonly seen in the market, such as denim, corduroy, linen, fleece, and knitted material. We also cover diverse object colors and shapes, such as shirts, vest, sweaters, pants, shorts, and skirts.

Controllable Visual-Tactile Synthesis

Model overview: Given a sketch input, we concatenate it with the positional encoding of the pixel coordinates and the object mask as the network input. We use U-net as the generator's backbone but split the decoder into two branches from an intermediate layer for visual and tactile output. The main challenge of visual-tactile synthesis lies in the large discrepancy between the receptive field of vision and touch. We have ground truth of global data for vision but only local patches for touch. To tackle the challenge, we propose to use dense supervision for visual data and sparse supervision for tactile data. We generate the tactile data in full scale and crop patch for the discriminator. We condition the visual discriminator on sketch and condition the tactile discriminator on sketch and vision. This way, we allow vision and touch to share the same encoding, leverage vision’s global information, and take full advantage of tactile local details.

Tactile-to-Hapic rendering: After obtaining the tactile output gx gy that represent the surface gradients, we convert them into a grayscale friction map required by TanvasTouch rendering device. We first compute the squared magnitude of the gradient z, then apply a non-linear mapping function for contrast enhancement, and finally resize it to the TanvasTouch screen size as the final friction map. We empirically find this helpful to enhance textures' feeling with electroadhesive force.

Acknowledgment

We thank Sheng-Yu Wang, Kangle Deng, Muyang Li, Aniruddha Mahapatra, and Daohan Lu for proofreading the draft. We are also grateful to Sheng-Yu Wang, Nupur Kumari, Gaurav Parmar, George Cazenavette, and Arpit Agrawal for their helpful comments and discussion. Additionally, we thank Yichen Li, Xiaofeng Guo, and Fujun Ruan for their help with the hardware setup. Ruihan Gao is supported by A*STAR National Science Scholarship (Ph.D.).

BibTeX

@article{gao2023controllable,
  title     = {Controllable Visual-Tactile Synthesis},
  author    = {Gao, Ruihan and Yuan, Wenzhen and Zhu, Jun-Yan},
  
  booktitle = {IEEE International Conference on Computer Vision (ICCV)},
  year      = {2023},
}