Deep generative models have various content creation applications such as graphic design, e-commerce, and virtual Try-on. However, current works mainly focus on synthesizing realistic visual outputs, often ignoring other sensory modalities, such as touch, which limits physical interaction with users.
In this work, we leverage deep generative models to create a multi-sensory experience where users can touch and see the synthesized object when sliding their fingers on a haptic surface. The main challenges lie in the significant scale discrepancy between vision and touch sensing and the lack of explicit mapping from touch sensing data to a haptic rendering device.
To bridge this gap, we collect high-resolution tactile data with a GelSight sensor and create a new visuotactile clothing dataset. We then develop a conditional generative model that synthesizes both visual and tactile outputs from a single sketch. We evaluate our method regarding image quality and tactile rendering accuracy. Finally, we introduce a pipeline to render high-quality visual and tactile outputs on an electroadhesion-based haptic device for an immersive experience, allowing for challenging materials and editable sketch inputs.
We show one example of FlowerJeans with clickable patches. Choose an image format from the "file" dropdown list, then click on the image to see the corresponding patches. We show patches of sketch, visual image, tactile gx, tactile gy, and the normal map. The top rows the ground truth and the bottom rows are our synthesized results.
Visual | Tactile gx | Tactile gy | Normal | |
---|---|---|---|---|
Ground Truth | ||||
Ours |
Here we show the generated visual and tactile output. The leftmost column shows the testing sketch input, and the rest columns show the generated visual and tactile outputs. The tactile outputs are shown in surface normal maps to represent the geometry information. Our model can generate high-quality visual and tactile outputs for various clothing categories, in terms of both global shape and local textures.
We also extend our method to synthesizing visual-tactile outputs given both sketches and text prompts. We first use DALL⋅E 2 to create variations of an original sketch in the TouchClothing dataset and then feed the edited sketches to our conditional generative models. Sample results are shown below, with paired visual image and tactile normal map. The top row shows the training object and the text prompt is listed below each testing sketch. Despite the unseen shapes of pockets and embroideries, our model generalizes well to match visual and tactile features with the shape information embedded in the sketch.
a black-white sketch of a hoodie, single-pixel width stroke
a black-white sketch of a shirt with daisy-shape patterns.
a sketch of a pair of jeans
a sketch of a pair of pants with two pockets, pockets are in the shape of a cat head
Sensor Setup: We obtain synchronized visual-tactile data by registration with Aruco marker tracking. We use a
GelSight R1.5 sensor to capture tactile local geometry. The sensor outputs gx and gy represent the image gradient of the surface normal in x and y directions, respectively.
We use a PiCamera mounted on the top for visual capture and aruco marker detection. The video shows a detailed demo.
Data Format: For each object, we have one full sketch, one full visual image, and about 200 tactile patches, covering about 1/6 of the total area.
Dataset: Our dataset contains
20 garments in total, covering various fabrics commonly seen in the market, such as denim, corduroy, linen, fleece, and knitted material.
We also cover diverse object colors and shapes, such as shirts, vest, sweaters, pants, shorts, and skirts.
Model overview: Given a sketch input, we concatenate it with the positional encoding of the pixel coordinates and the object mask as the network input. We use U-net as the generator's backbone but split the decoder into two branches from an intermediate layer for visual and tactile output.
The main challenge of visual-tactile synthesis lies in the large discrepancy between the receptive field of vision and touch. We have ground truth of global data for vision but only local patches for touch.
To tackle the challenge, we propose to use dense supervision for visual data and sparse supervision for tactile data. We generate the tactile data in full scale and crop patch for the discriminator.
We condition the visual discriminator on sketch and condition the tactile discriminator on sketch and vision.
This way, we allow vision and touch to share the same encoding, leverage vision’s global information, and take full advantage of tactile local details.
Tactile-to-Hapic rendering: After obtaining the tactile output gx gy that represent the surface gradients, we convert them into a grayscale friction map required by TanvasTouch rendering device.
We first compute the squared magnitude of the gradient z, then apply a non-linear mapping function for contrast enhancement, and finally resize it to the TanvasTouch screen size as the final friction map.
We empirically find this helpful to enhance textures' feeling with electroadhesive force.
We thank Sheng-Yu Wang, Kangle Deng, Muyang Li, Aniruddha Mahapatra, and Daohan Lu for proofreading the draft. We are also grateful to Sheng-Yu Wang, Nupur Kumari, Gaurav Parmar, George Cazenavette, and Arpit Agrawal for their helpful comments and discussion. Additionally, we thank Yichen Li, Xiaofeng Guo, and Fujun Ruan for their help with the hardware setup. Ruihan Gao is supported by A*STAR National Science Scholarship (Ph.D.).
@article{gao2023controllable,
title = {Controllable Visual-Tactile Synthesis},
author = {Gao, Ruihan and Yuan, Wenzhen and Zhu, Jun-Yan},
booktitle = {IEEE International Conference on Computer Vision (ICCV)},
year = {2023},
}