[OpenAI Spinning Up] Introduction to RL / Part 3: Intro to Policy Optimization

📌프로그래머스 인공지능 데브코스 6기 강화학습 스터디

Open AI spinning up - Introduction to RL

이 섹션에서는 policy optimization 알고리즘의 수학적 토대를 알아볼 것이고, 다음과 같은 policy gradient 이론의 세 가지 key results를 다룰 것이다.

policy parameter에 대해 policy performance의 gradient를 설명하는 가장 간단한 공식
위 식에서 필요 없는 term을 drop하는 규칙
위 식에 필요한 term을 더하는 규칙

Deriving the Simplest Policy Gradient

stochastic, parameterized policy $\pi_\theta$가 있다고 해보자. 우리의 목표는 expected return $J(\pi_\theta) = E_{\tau \sim \pi_\theta}[R(\tau)]$ 을 최대화하는 것이다. 이를 위해 $\nabla_{\theta}J(\pi_{\theta})$를 구하게 되는데, 이를 policy gradient라고 하고, 이를 최적화하는 알고리즘을 policy gradient algorithm이라고 한다. 이 미분을 구하기 위해 $R(\tau)$가 finite horizon undiscounted return 이라고 가정한다.(infinite도 거의 동일)

우리는 다음과 같은 gradient ascent방법으로 policy를 optimize한다.

$\theta_{k+1} = \theta_k + \alpha\nabla_\theta J(\pi_\theta)|_{\theta_k}$

위 공식을 실제로 쓰기 위해서는 policy gradient가 수학적으로 계산될 수 있도록 해야한다. 이는 두 가지 단계를 거친다.

1) policy performance의 해석적 gradient를 기댓값의 형태로 유도하고 2) agent-environment interaction steps에서의 데이터를 통해 기댓값의 sample estimate를 만든다.

위의 단계를 따라 가장 간단한 policy gradient 형태를 만들어 보자.

trajectory의 확률

$https://spinningup.openai.com/en/latest/_images/math/cbc185e90569437c9216ea06df6e91244df1972b.svg$
log-derivative trick

위 식을 미분하면 log의 미분과 chain rule로 다음과 같이 계산된다.

$https://spinningup.openai.com/en/latest/_images/math/8025b1c01b73e7f373fa438c15743c5391e2f28f.svg$
trajectory의 log-probability

1번에 $log$를 씌운 것

$https://spinningup.openai.com/en/latest/_images/math/2c8b420444accb2f54391017f28ab21b97bab0ee.svg$
gradients of environment functions

$\theta$에 의존하지 않는 $\rho_0(s_0), P(s_{t+1}|s_t,a_t),R(\tau)$ 등의 미분값은 0이 된다.
Grad-Log-Prob of a Trajectory

3번을 미분하면 다음과 같다.

$https://spinningup.openai.com/en/latest/_images/math/3ef66d94ee26cfa69015915dbd112ea78fb5e7ba.svg$

따라서 $\nabla J(\pi_{\theta})$의 전체 계산 과정은 다음과 같다.

$https://spinningup.openai.com/en/latest/_images/math/b5e135d2ae389147267372abc1c5b20e644ec881.svg$

위 식을 보면 $\nabla J(\pi_{\theta})$는 기댓값이다. 따라서 sample mean으로 estimate(추정)할 수 있다. 만약 우리가 $\pi_{\theta}$를 따르는 에이전트가 환경을 돌아다니게 하여 여러 개의 trajectory $D = {\tau_i}_{i = 1,...,N}$을 모은다면 policy gradient는 다음과 같이 평균으로 표현할 수 있다.

$https://spinningup.openai.com/en/latest/_images/math/3d29a18c0f98b1cdb656ecdf261ee37ffe8bb74b.svg$

위 식이 우리가 얻으려고 했던 가장 간단한 형태의 policy gradient의 expression이다.

만약 우리가 $\nabla log\pi_{\theta}(a|s)$를 계산할 수 있는 형태로 나타내고, trajectory를 모을 수 있는 환경에서 policy를 돌렸다면 우리는 policy gradient를 계산하고, 각 step에서 업데이트할 수 있다.

Expected Grad-Log-Prob Lemma

policy gradient에 대한 이론 중 광범위하게 활용되는 중간 result인 EGLP lemma는 다음과 같다.

$\underset{x\sim P_{\theta}}{E}[\nabla_{\theta}logP_\theta(x)]=0$

Don’t Let the Past Distract You

우리가 전에 구한 policy gradient expression은 다음과 같다.

$https://spinningup.openai.com/en/latest/_images/math/e8b721fa0eb7fa2aa4b088106518b3ee88ff7707.svg$

여기서 한 step을 시행하면 각 action의 log-probability가 $R(\tau)$, 즉 지금까지 얻은 모든 보상의 합계에 비례하여 올라간다. 그러나 에이전트는 자신이 취한 action에 대한 결과에 대해서만 강화를 받아야하는데, action을 취하기 전에 얻은 보상은 이 action이 얼마나 좋았는지와는 관계가 없다. action을 취한 이후의 보상만 중요한 것이다. 따라서 다음과 같이 R을 바꿔 표현하면 이 문제가 해결된다.

$https://spinningup.openai.com/en/latest/_images/math/62e6b4e06a1c35fac29e94103988cdc6e940660b.svg$

위 형태는 trajectory의 어떤 포인트 이후 보상의 합이기 때문에 reward-to-go policy gradient라고 부르고, 이 expression은 state-action pair의 reward-to-go에 의존한다. 아래의 식을 특정 포인트에서의 reward-to-go라고 부른다.

$https://spinningup.openai.com/en/latest/_images/math/d299609cb8b73f294e77708f9cdc6ea0024b6c6c.svg$

Baselines in Policy Gradients

EGLP lemma에 따라 state에 의존하는 임의의 함수 $b$가 있을 때, 다음을 만족한다.

$https://spinningup.openai.com/en/latest/_images/math/3bedd2ab2262f396f232d49c8c85621ce5397955.svg$

이것은 우리가 저렇게 생긴 term을 더하거나 뺄 수 있게 해주고, 우리의 policy gradient의 expression을 다음과 같이 바꿔 쓸 수 있게 해준다.

$https://spinningup.openai.com/en/latest/_images/math/3a111dcb6e04aa632bd69e9a7e769e06e2530a0a.svg$

이런 방식으로 쓰이는 임의의 함수 $b$를 baseline이라고 부른다.

baseline으로 가장 흔하게 쓰이는 함수는 on-policy value function $V^\pi(s_t)$이다. 경험적으로 봤을 때 $b(s_t) = V^\pi(s_t)$일 때 policy gradient를 위한 sample estimate에서 분산을 줄이는 효과가 있다.

💡일반적으로 $V^\pi(s_t)$은 직접적으로 계산할 수 없어 $V_\phi(s_t)$로 근사하여 구하게 된다. 이는 보통 신경망을 통해 구현되며, policy와 동시에 업데이트 된다.
대부분의 policy optimization 알고리즘의 구현에서 $V_{\phi}$를 구하기 위한 간단한 방법은 MSE를 최소화하는 것이다.

$https://spinningup.openai.com/en/latest/_images/math/a82208dd637243514710948c4ebbc3c59e9a2e57.svg$

Other Forms of the Policy Gradient

지금까지 policy gradient의 일반식이 다음과 같음을 보았다.

$https://spinningup.openai.com/en/latest/_images/math/eb524fc4ce3052c9058d2221471ac8b302c9c023.svg$

이 때 $\Phi_t$는 다음 식들이 될 수 있었다.

$https://spinningup.openai.com/en/latest/_images/math/cecde3d5124076dfc773c8fa658e61f41cb3efc2.svg$

$https://spinningup.openai.com/en/latest/_images/math/02a2c10508e4a4c018634a2ba03384350faa7cab.svg$

$https://spinningup.openai.com/en/latest/_images/math/65fd02144cdac143a61396dc8fe585e8db5f7d81.svg$

여기에 on-policy action-value function $\Phi_t = Q^{\pi_\theta}(s_t,a_t)$, advantage function $A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)$ 도 $\Phi_t = A^\pi_{\theta}(s_t,a_t)$로 쓸 수 있다.

모든 $\Phi_t$는 policy gradient에 대해 분산은 다르지만, 같은 기댓값을 가져온다.

대학원생의 공부노트