Development and evaluation of wireless 3D video conference system using decision tree and behavior network - EURASIP Journal on Wireless Communications and Networking

15 Jul.,2022

Video conferencing is a communication technology that allows multiple users to communicate with each other by both images and sound signals. As the performance of wireless network has improved, the data are transmitted in real time to mobile devices with the wireless network. However, there is the limit of the amount of the data to be transmitted. Therefore it is essential to devise a method to reduce data traffic. There are two general methods to reduce data rates: extraction of the user's image shape and the use of virtual humans in video conferencing. However, data rates in a wireless network remain high even if only the user's image shape is transferred. With the latter method, the virtual human may express a user's movement erroneously with insufficient information of body language or gestures. Hence, to conduct a video conference on a wireless network, a method to compensate for such erroneous actions is required. In this article, a virtual human-based video conference framework is proposed. To reduce data traffic, only the user's pose data are extracted from photographed images using an improved binary decision tree, after which they are transmitted to other users by using the markup language. Moreover, a virtual human executes behaviors to express a user's movement accurately by an improved behavior network according to the transmitted pose data. In an experiment, the proposed method is implemented in a mobile device. A 3-min video conference between two users was then analyzed, and the video conferencing process was described. Photographed images were converted into text-based markup language. Therefore, the transmitted amount of data could effectively be reduced. By using an improved decision tree, the user's pose can be estimated by an average of 5.1 comparisons among 63 photographed images carried out four times a second. An improved behavior network makes virtual human to execute diverse behaviors.


Wireless Conference System

To express a user's movements by a virtual human, it is necessary to devise a method to extract and transmit a user's features and then reconstruct the virtual conference by virtual humans. This section describes a method to estimate user's pose and control a virtual human with the estimated pose.

3.1. Overview

The proposed virtual human-based video conference framework consists of a definition stage that predefines the data for video conferencing, a recognition stage that extracts pose data from the images, and a reconstruction stage that reconstructs the virtual conference (Figure 1). The definition stage is only performed once when a video conference is started, whereas the recognition and reconstruction stages are performed repeatedly during a video conference.

Figure 1

Virtual conference process.

Full size image

Data are determined in definition stage as follows. First, to estimate a user's pose, the necessary images must be defined. This requires the generation of images for a user's expected pose photographed by camera. The generated images are then compared with real-time photographed images of the user to estimate the user's pose. However, pose estimation will be time-consuming if the number of expected poses is excessive. Thus, the pose-estimation time can be reduced by using only a subset of expected pose images by constructing a binary decision tree (referred to as the pose decision tree)

Next, a behavior network is defined to generate the behavior that is executed by a virtual human. At first, an action is defined by a virtual human's joint angles, and a behavior is expressed by its actions. Selected behaviors are executed by virtual human. The pose images, pose decision tree, motions, actions, and consecutive action network are shown in Figure 1. These are all defined in the definition stage. This behavior network is referred to as the consecutive action network.

In the recognition stage, images are created by photographing users at certain intervals. A user's poses are estimated by comparing the photographed images with the pose decision tree. The estimated poses are then transmitted to other users through the network. In the reconstruction stage, a user's presence is expressed in a virtual human by considering the estimated pose.

3.2. Framework structure

In this section, we propose a framework that expresses a user's presence through a virtual human in a video conference. The framework that handles video conferences is structured as shown in Figure 2.

Figure 2

Framework structure.

Full size image

The recognition stage converts the estimated pose into markup language, which is transmitted to the network as follows. The photographed images are received by the image receiver and sent to the background learner and silhouette extractor. Background learner acquires backgrounds when the user is absent and then transfers the background image to the silhouette extractor. Subsequently, the silhouette extractor extracts the shapes of users from the received images by considering the background images and transmits them to the pose estimator. The pose estimator searches the pose decision tree and estimates the poses of the received images. The estimated poses are then transmitted to the network through the message generator and message sender (the former creates messages and the latter transmits them to other users). Each message contains a user's pose and speech.

The reconstruction stage then creates the image and voice from the received messages as follows. The message receiver transmits the pose and speech to the behavior planner and speech generator, respectively. The behavior planner plans the behaviors to be executed by the virtual human. The virtual human controller then executes the planned behaviors.

3.2.1. Image receiver and silhouette extractor

In the recognition stage, the images are created by photographing users at certain intervals. The image receiver receives the photographed user's images and transmits them to the silhouette extractor. The h th user-image is defined as ih , as shown in Equation (1). The set of user-images is defined as Set I:

i h ∈ I , I = { i 1 , i 2 , . . . }


Here, the image interval is denoted as εInterval. To estimate the poses precisely, the user-images are converted into silhouettes like silhouette extraction process [10], as shown in Figure 3.

Figure 3

Silhouette extraction process.

Full size image

A silhouette is an image with only a user's shape without the background. The background images are recorded by the background learner in definition stage and then transferred to silhouette extractor to remove the background from the user-images. Then, the user-silhouette is extracted from the difference between the recorded background image and the user-image. The h th extracted user-silhouette from the h th user-image is defined as sh , as shown in Equation (2). The set of user-silhouettes is defined as Set S:

s h ∈ S , S = { s 1 , s 2 , . . . . }


3.2.2. Pose decision tree and pose estimator

The pose estimator, which estimates poses with the extracted silhouette in the recognition stage, must recognize multiple poses in real time in a mobile environment. However, the time to estimate poses increases with the number of poses because the number of comparisons also increases. To solve this problem, we propose a pose decision tree.

In the definition stage, the expected pose images of users are predefined to construct the pose decision tree in advance. First of all, the set of all expected pose is defined as the Set P.

p i ∈ P , P = { p 1 , p 2 , . . . }


The set of expected pose images, Set E, is defined as shown in Equation (3) to estimate the pose of the extracted silhouette. ei is the image that is used to estimate pose pi .

e i ∈ E , E = { e 1 , e 2 , . . . }


The expected pose images are also converted into expected silhouettes. The set of expected silhouettes is defined as Set R, where ri is the i th silhouette expected.

r i ∈ R , R = { r 1 , r 2 , . . . }


The pose decision tree consists of nodes that contain each expected silhouette. The i th node ni is defined as shown in Equation (5):

n i = < r i , n Left i , n Right i , m i , v i >


where n Left i and n Right i are the left and right nodes of node ni , respectively; and mi and vi are the matching value and center value of node ni (range 0 to 1), respectively. The matching value indicates the similarity of two silhouettes. For example, if its value is 1, the silhouettes are considered identical only when they have exactly the same images. In contrast, if its value is 0, the silhouettes are considered identical regardless of their differences. The matching value is determined to estimate the pose by establishing various values. The center value, which expresses a standard based on a search of the left and right child nodes, is automatically established when the pose decision tree is constructed. As shown in Equation (6), there is a one-to-one relation between pose ri and node ni .

The decision tree is constructed as follows. First, nodes are created for all expected silhouettes included in Set R. Second, node n1 is defined as the root node. The remaining nodes in Set R are then registered as the child nodes of n1 (Figure 4).

Figure 4

Selection of root node.

Full size image

Third, the child nodes of n1 are sorted after comparing the expected silhouette e1 of the root node to that of the child node (Figure 5).

Figure 5

Sorting of child node.

Full size image

As shown in Equation (6), the comparison is expressed as a normalized value after calculating the correlation coefficient of the two expected silhouettes. The value ranges from 0 to 1.

R ( r 1 , r 2 ) = ∑ x 1 , x 2 ( T ′ ( x 1 , x 2 ) ⋅ I ′ ( r 1 + x 1 , r 2 + x 2 ) ) ∑ x 1 , x 2 T ′ ( x 1 , y ) 2 ⋅ ∑ x 1 , x 2 I ′ ( r 1 + x 1 , r 2 + x 2 ) 2


Fourth, when there are o children, the o + 1 4 th and ( o + 1 ) * 3 4 th nodes are defined as the left and right nodes, respectively. The node whose index is equal to or smaller than o + 1 2 in terms of the sorting sequence moves to the left node, whereas the node greater than o + 1 2 moves to the right node (Figure 6).

Figure 6

Sorting of left and right child nodes.

Full size image

Fifth, the mean of the correlation coefficients between the last node on the left and the first node on the right is set to the center value of the root node. Lastly, both left and right nodes sort the child nodes through repetitive comparisons just as in the case of the root node.

In the recognition stage, the pose decision tree is used as follows. The user-silhouette is compared to the silhouette of the root node. If the correlation coefficient of two silhouettes is equal to or greater than εPoseMatching, the index of the root node is transmitted to the message generator. Otherwise, the user-silhouette is compared to left child node of the root node. If it is greater than the center value, the user-silhouette is compared to the right child node. Therefore, the comparison of nodes continues until the correlation coefficient of the two silhouettes is over εPoseMatching or the terminal node is reached. The index of the node that is ultimately reached is also transmitted to the message generator.

3.2.3. Action, consecutive action network and behavior

In the definition stage, the actions to be executed by a virtual human are defined. Action is the movement for virtual human to express pose when pose index is received shown in Equation (8).

a j = < p i , d j , c 1 j , c 2 j , . . . >


where pi is the pose that would be expressed by the j th action, and dj is the duration of the j th action. In addition, c 1 j is the first joint angle required for the virtual human to execute the j th action aj .

a j ∈ A , A = { a 1 , a 2 , . . . }


If an action is defined, and based on this, then the network is defined in order to select and execute actions consecutively whenever every pose index is received. Behavior planner defines the network to execute consecutive actions in definition stage as follows. First, start-poses and goal-poses are placed. Start-poses are the starting poses to generate consecutive actions. It is the first pose for consecutive actions. Goal-poses are the targeted poses. The pose for the last action among the generated consecutive actions becomes to be the goal-pose. Hence, all the poses of Set P are placed on both sides of the network as shown in Figure 7.

Figure 7

Composition of consecutive action network.

Full size image

Next, all the actions of Set A are placed, and the sequences of consecutive actions are expressed as a tree by using directed acyclic graph (DAG). The action nodes which contain one action of Set A are primarily defined, and it is placed between start-poses and goal-poses, as shown in Figure 7. After action node is arranged, then DAG connects each action node. To prevent tree's containing any loop, an action node which has identical action, is repeatedly defined like action a1 of Figure 7.

Next, the transition probability of all DAG is defined. The sums of probabilities transit from each action to another action are normalized to be 100.

Finally, start-pose is connected with the actions which are executable at first after receiving the pose index. Goal-pose is also connected with the actions which are executable lastly. In Figure 8, start-poses p1 and p2 are connected to each corresponding actions, a1 and a2. Among goal-poses, p1 is connected with two action nodes, m1and m3, more than one. Two actions of two nodes can be executed lastly. Therefore, two nodes are connected to goal-pose p1. In the case of p2, however, one node between two action nodes containing a2 is only connected. Action node composing network is defined as show in Equation (10).

Figure 8

Selection of the root node in pose decision tree.

Full size image

m k = < a j , s k , g k , o 1 k , o 2 k , . . . , q 1 k , q 2 k >


where mk is a action node that contains aj . sk and gk represent the index of start-pose and goal-pose connected with action node mk . ok x means the other action nodes to which action node mk is connected in x th. qk y is the probability of transition to ok x . The network composing of action nodes is defined as the consecutive action network.

The defined consecutive action network is used when an action is selected in the reconstruction stage. By using the pose index received in time t - 1 as start-pose index and the pose index received in time t as goal-pose index, behavior planner generates consecutive actions as follows. First, among the several numbers of start-poses, the pose in the pose index received in just time t - 1 is selected. Next, among the several numbers of goal-poses, the pose in pose index received in just time t is selected. Next, all action nodes and connections which is movable from the selected start-pose to the selected goal-pose, is selected. Next, the transition probability for the selected nodes besides each action node is normalized up to be 100. Next, among the selected connections, one connection is selected through the probability. If the only one action node is connected it can be directly selected. The selection of connections is repeatedly processed until the action node which is connected with the goal-pose is reached. Finally, the consecutive actions are constructed by connecting the actions in all the visiting action nodes, and are defined as a behavior. The defined behavior is then executed.

For example, behavior is generated as shown in Figure 8 when pose index 2 is received at time t - 1 and time t from Figure 7. Each m4 and m2 is activated by connecting to start-pose and to goal-pose. The other activated two action nodes, m7 and m, exist in the connection from m4 to m6. From the consecutive action network, action nodes according to the connections from m4 are visited as the connection of mmm6 or mm·m6. Therefore, the behaviors having the orders of a2a3a2 or a2a a2 are generated and executed.