Robotic manipulation tasks often rely on static cameras for perception,
which can limit flexibility, particularly in scenarios like robotic surgery
and cluttered environments where mounting static cameras is impractical.
Ideally, robots could jointly learn a policy for dynamic viewpoint and manipulation.
However, it remains unclear which state-action space is most suitable for this
complex learning process. To enable manipulation with dynamic viewpoints and
to better understand impacts from different state-action spaces on this policy
learning process, we conduct a comparative study on the state-action spaces for
policy learning and their impacts on the performance of visuomotor policies that
integrate viewpoint selection with manipulation. Specifically, we examine the
configuration space of the robotic system, the end-effector space with a dual-arm
Inverse Kinematics (IK) solver, and the reduced end-effector space with a look-at
IK solver to optimize rotation for viewpoint selection. We also assess variants
with different rotation representations. Our results demonstrate that state-action
spaces utilizing Euler angles with the look-at IK achieve superior task success
rates compared to other spaces. Further analysis suggests that these performance
differences are driven by inherent variations in the high-frequency components
across different state-action spaces and rotation representations.