【翻訳】Pythonを使ったInstagramにおけるWebサービスの効率化

f:id:kikuchi1201:20160622104353p:plain

原文:Web Service Efficiency at Instagram with Python

Pythonを使ったInstagramのWebサービスの効率化
後日談

この文章は機械翻訳を通した後になるべく文法が通るように筆者が意訳した為見ずらい文章が多々あります。また意味が曖昧な言葉はすべて()で括っています。もし間違い、指摘がございましたらコメント欄にお伝え下さい

意味が曖昧な単語

CPUの回帰(CPU regression)
信頼区間(confidence intervals)
振り返り(battle regression.)

Pythonを使ったInstagramのWebサービスの効率化

Instagram currently features the world’s largest deployment of the Django web framework, which is written entirely in Python.

InstagramはDjango(Pythonで書かれたウェブ・フレームワーク)を使って開発されており、完全にPythonで書かれています。

We initially chose to use Python because of its reputation for simplicity and practicality, which aligns well with our philosophy of “do the simple thing first.” But simplicity can come with a tradeoff: efficiency.

私たちは"シンプル"さと"実用性"を評価して、Pythonを使用することを最初に決めました。それは私たちの哲学である「単純なことを最初にする」とよく結びつけます。
しかし、単純性は"効率"とトレードオフです。

Instagram has doubled in size over the last two years and recently crossed 500 million users, so there is a strong need to maximize web service efficiency so that our platform can continue to scale smoothly.

Instagramは過去2年間で2倍大きくなり、直近5億人のユーザーが利用していて、私達はこのプラットフォームを効率を最大にして、スムーズに拡大縮小し続ける必要性がありました。

In the past year we’ve made our efficiency program a priority, and over the last six months we’ve been able to maintain our user growth without adding new capacity to our Django tiers.

そこで去年私たちは効率を優先事項にプログラムを開発し、過去6か月に渡り新しい容量をDjangoに加えずにサービスの成長を維持することができました。

In this post, we’ll share some of the tools we built and how we use them to optimize our daily deployment flow.

この記事は、私たちの日常の開発の流れを最適化するために我々が開発したツールと、それをどのよう使用するのか、いくつかを共有しましょう。

なぜ効率を?(Why Efficiency?)

Instagram, like all software, is limited by physical constraints like servers and datacenter power.

すべてのソフトウェアがそうであるように、Instagramはサーバーおよびデータセンターのような物理的な制約によって制限されています。

With these constraints in mind, there are two main goals we want to achieve with our efficiency program:

これらの制約を念頭において私たちが効率プログラムで達成したい主な2つのゴールがあります。

1.Instagram should be able to serve traffic normally with continuous code rollouts in the case of lost capacity in one data center region, due to natural disaster, regional network issues, etc.

1.Instagramは自然災害、地域ネットワークの問題などで1つのデータセンターの地域で容量が失われた場合でも連続的なコードロールアウト(運用展開)で、通常のトラフィックにサービスを提供することができる必要があります。

2.Instagram should be able to freely roll out new products and features without being blocked by capacity.

2.Instagramは自由に容量に遮られることなく新製品と機能をロールアウトすることができる必要があります。

To meet these goals, we realized we needed to persistently monitor our system and battle regression.

これらの目標を達成するため、私たちは持続的にシステム監視する事とその振り返り(battle regression.)が必要である事に気付きました。

効率の定義(Defining Efficiency)

Web services are usually bottlenecked by available CPU time on each server.

Webサービスは通常は各サーバ上の利用可能なCPU時間がボトルネックとされます。

Efficiency in this context means using the same amount of CPU resources to do more work, a.k.a, processing more user requests per second (RPS).

このコンテキスト中の効率は毎秒(RPS)より多くのユーザ・リクエストを処理してより多くの仕事(データ処理)をするために同じ量のCPU資源を使用することを意味します。

As we look for ways to optimize, our first challenge is trying to quantify our current efficiency.

最適化する方法を探していくにつれて私たちの最初の課題は効率化の定量化でした。

Up to this point, we were approximating efficiency using ‘Average CPU time per requests,’ but there were two inherent limitations to using this metric:

このポイントまで私たちは「1つのリクエスト当たりの平均のCPU時間」を使用して効率化に接近していました。しかし、このメトリック(測定基準)の使用に、2つの固有の制限がありました。

Diversity of devices. Using CPU time for measuring CPU resources is not ideal because it is affected by both CPU models and CPU loads.

1つにデバイスの多様性です。CPUモデルとCPU負荷の両方に影響されるため、CPUリソースを測定するためのCPU時間を使用することは理想的ではありません。

Request impacts data. Measuring CPU resource per request is not ideal because adding and removing light or heavy requests would also impact the efficiency metric using the per-requests measurement.

2つめに要求が影響を与えるデータです。データの追加や軽い削除、重いリクエストなどこれらはそれぞれの要求測定を使用して効率メトリックに影響を与えるため、リクエスごとにCPUリソースを測定することは理想的ではありません。

Compared to CPU time, CPU instruction is a better metric, as it reports the same numbers regardless of CPU models and CPU loads for the same request.

同じリクエストのCPUモデルとCPU負荷に関係なく同じ番号を報告するように、CPU時間と比較してCPUの命令で測定する事は良好なメトリックだと私たちは考えました。

Instead of linking all our data to each user request, we chose to use a ‘per active user’ metric.

従って私たちはデータをすべて各ユーザ・リクエストにリンクする代わりに1つの"アクティブユーザあたりのリクエスト"を使用することに決めました。

We eventually landed on measuring efficiency by using ‘CPU instruction per active user during peak minute.’ With our new metric established, our next step was to learn more about our regressions by profiling Django.

私たちは最終的に「ピーク分中のアクティブユーザにおけるCPU命令」の使用により効率を測定することを決めました。以上の新しいメトリックを確立した状態で次のステップであるDjangoをプロファイリングすることにより当社の振り返りについてご紹介したいと思います。

Djangoサービスのプロファイリング(Profiling the Django Service)

There are two major questions we want to answer by profiling our Django web service:

私たちはDjangoのWebサービスを分析することでお答えしたい、二つの質問があります。

Does a CPU regression happen? What causes the CPU regression and how do we fix it?

CPUの回帰が発生しますか？
何がCPUの回帰を引き起こしそれをどのように解決するのですか？

To answer the first question, we need to track the CPU-instruction-per-active-user metric.

最初の質問に答えるために、我々はCPU命令あたりのアクティブユーザーメトリックを追跡する必要があります。

If this metric increases, we know a CPU regression has occurred.

このメトリックが増加した場合我々はCPUの回帰が発生している事を知っています。

The tool we built for this purpose is called Dynostats.

私たちがこの目的のために作成したツールはDynostatsと呼ばれます。

Dynostats utilizes Django middleware to sample user requests by a certain rate, recording key efficiency and performance metrics such as the total CPU instructions, end to end requests latency, time spent on accessing memcache and database services, etc.

Dynostatsはmemcache(汎用の分散型メモリキャッシュシステム)や終了要求の待ち時間の合計CPU命令としてキー効率と性能指標を記録し、一定の割合でユーザーの要求をサンプリングするためにDjangoのミドルウェアを利用します。

On the other hand, each request has multiple metadata that we can use for aggregation, such as the endpoint name, the HTTP return code of the request, the server name that serves this request, and the latest commit hash on the request.

一方で各リクエストにはそれぞれ終了点名、リクエストのHTTPリターン・コード、このリクエストに役立つサーバー名およびリクエストに応じた最新のコミットハッシュとして集計に使用できる多数のメタデータがあります。

Having two aspects for a single request record is especially powerful because we can slice and dice on various dimensions that help us narrow down the cause of any CPU regression.

我々は任意のCPU回帰の原因を絞り込むため為に様々な寸法上で切り分ける事ができるので、単一の要求レコードに対して二つの側面を見ることができ、これは特に強力です。

For example, we can aggregate all requests by their endpoint names as shown in the time series chart below, where it is very obvious to spot if any regression happens on a specific endpoint.

任意の回帰は、特定のエンドポイント上で発生した場合にスポットすることは非常に明白であります。以下の時系列グラフに示すように、例えばエンドポイント名ですべての要求を集約することができます。

f:id:kikuchi1201:20160622094755p:plain

CPU instructions matter for measuring efficiency — and they’re also the hardest to get.

CPUの命令は効率を測定するために重要です - また、取得することが最も難しいです。

Python does not have common libraries that support direct access to the CPU hardware counters (CPU hardware counters are the CPU registers that can be programmed to measure performance metrics, such as CPU instructions).

PythonにはCPUハードウェア・カウンター(CPUハードウェア・カウンターは、CPU命令のようなパフォーマンス・メトリクスを測定するようにプログラムすることができるCPU命令)へのダイレクトアクセスをサポートする共通のライブラリーがありません。

Linux kernel, on the other hand, provides theperf_event_open system call.

一方でLinux カーネルはperf_event_openシステムコールを提供します。

Bridging through Python ctypes enables us to call the syscall function in standard C library, which also provides C compatible data types for programming the hardware counters and reading data from them.

従ってPythonのctypesを介して橋渡しすることでハードウェアカウンタをプログラミングし、そこからデータを読み出すためのCと互換性のあるデータを標準Cライブラリ内のシステムコール関数として呼び出すことを可能にしています。

With Dynostats, we can already find CPU regressions and dig into the cause of the CPU regression, such as which endpoint gets impacted most, who committed the changes that actually cause the CPU regression, etc.

Dynostatsで我々はすでにCPUの回帰を見つけることができますし、実際にCPU回帰などを引き起こす変更をコミットし、エンドポイントが最も影響を受けるかなど、CPUの回帰原因を掘り下げます。

However, when a developer is notified that their certain changes cause CPU regression, they usually has a hard time finding the problem.

しかしながらある変更がCPU回帰を引き起こすと開発者に通知された場合でもその原因を見つける事は通常困難です。

If it was obvious, the regression probably wouldn’t have been committed in the first place!

それはもし明白だった場合は、回帰はおそらく最初の場所でコミットされていなかった事にしましょう！(ジョーク?)

That’s why we needed a Python profiler that the developer can use to find the root cause of the regression (once Dynostats identifies it).

ですから、私達は開発者が（Dynostatsが識別した後）回帰の根本的な原因を見つけるために使用できるPythonのプロファイラが必要でした。

Instead of starting from scratch, we decided to make slight alterations to cProfile, a readily available Python profiler.

ゼロからスタートする代わりに、私たちはcProfileを容易に利用可能なパイソン・プロフィーラーへの少しの変更を行うことに決めました。

The cProfile module normally provides a set of statistics describing how long and how often various parts of a program were executed.

cProfileモジュールは、プログラムの様々な部分がどれくらいの時間でどれくらい頻繁に実行されたかを説明する1セットの統計を通常提供します。

Instead of measuring in time, we took cProfile and replaced the timer with a CPU instruction counter that reads from hardware counters.

測定する代わりに、私たちはcProfileを使いハードウェア・カウンターから読むCPU命令カウンタにタイマーを取り替えました。

The data is created at the end of the sampled requests and sent to some data pipelines.

データはサンプリングされたリクエストの終わりに作成されいくつかのデータ・パイプラインに送られます。

We also send metadata similar to what we have in Dynostats, such as server name, cluster, region, endpoint name, etc.

さらにサーバー名、クラスタ、地域、終了点名などのような、私たちが持っているメタデータをDynostatsに送ります。

On the other side of the data pipeline, we created a tailer to consume the data.

データ・パイプラインの反対側には私たちがデータを消費するためのテーラーを作成しました。

The main functionality of the tailer is to parse the cProfile stats data and create entities that represent Python function-level CPU instructions.

テーラーの主な機能性はcProfile統計データを解析しパイソン機能レベルでCPU命令を表わす実体を作成することです。

By doing so, we can aggregate CPU instructions by Python functions, making it easier to tell which functions contribute to CPU regression.

そうすることによって、私たちはどの機能がCPUの回帰に依存するのかを伝えることをより簡単にしてパイソン機能によるCPU命令を集めることができます。

モニタリングおよび警報メカニズム(Monitoring and Alerting Mechanism)

At Instagram, we deploy our backend 30–50 times a day.

Instagramではバックエンドに1日に30-50回展開します。

Any one of these deployments can contain troublesome CPU regressions.

これらの展開はいずれも面倒なCPUの回帰を含んでいます。

Since each rollout usually includes at least one diff, it is easy to identify the cause of any regression.

それぞれのロールアウトは、通常少なくとも一つの差分を含むので、任意の回帰の原因を特定することは容易です。

Our efficiency monitoring mechanism includes scanning the CPU instruction in Dynostats before and after each rollout, and sending out alerts when the change exceeds a certain threshold.

私たちの効率監視メカニズムはそれぞれのロールアウトの前後でDynostatsにおけるCPUの命令をスキャンし、変化がある閾値を超えたときに警告を送信します。

For the CPU regressions happening over longer periods of time, we also have a detector to scan daily and weekly changes for the most heavily loaded endpoints.

より長い期間にわたって起こっているCPUの回帰の為に私たちも毎日スキャンする検出器と、最も負荷の高いエンドポイントの毎週変化の情報を保持します。

Deploying new changes is not the only thing that can trigger a CPU regression.

また新しい変更のDeployはCPU回帰を引き起こすことができるただ一つのトリガーではありません。

In many cases, the new features or new code paths are controlled by global environment variables (GEV).

多くの場合、新機能や新しいコードパスはグローバル環境変数（GEV）によって制御されます。

There are very common practices for rolling out new features to a subset of users on a planned schedule.

これらは計画されたスケジュールに基づいてユーザーのサブセットに新しい機能を展開するための非常に一般的な慣行があります。

We added this information as extra metadata fields for each request in Dynostats and cProfile stats data.

私たちは、DynostatsとcProfileの統計データ中の各リクエストの余分なメタデータ・フィールドとしてこの情報を加えました。

Grouping requests by those fields reveal possible CPU regressions caused by turning the GEVs。

これらのフィールドによってグループ化の要求がGEVsを回すことによって引き起こされるCPUの回帰の可能性を明らかにしました。

This enables us to catch CPU regressions before they can impact performance.

これらはパフォーマンスに影響が出る前にCPUの回帰をキャッチすることを可能にします。

次は?(What’s Next?)

Dynostats and our customized cProfile, along with the monitoring and alerting mechanism we’ve built to support them, can effectively identify the culprit for most CPU regressions.

Dynostatsおよび私たちのカスタマイズされたcProfileはこれらをサポートするために、構築したモニタリングおよび警報するメカニズムに加えてほとんどのCPU再起の原因を有効に識別することができます。

These developments have helped us recover more than 50% of unnecessary CPU regressions, which would have otherwise gone unnoticed.

これらの開発したツールのおかけで見過ごされてきたであろう不要なCPU回帰の50％以上を回復する助けています。

There are still areas where we can improve and make it easier to embed into Instagram’s daily deployment flow:

私たちがそれを改善しInstagramの毎日の配備フローへ埋め込むことがより簡単にできるエリアがまだあります。

The CPU instruction metric is supposed to be more stable than other metrics like CPU time, but we still observe variances that make our alerting noisy.

CPUの命令メトリックは、CPU時間などの他の指標よりも安定であると考えられるが、まだ私達はノイズの多い差異を検出できる騒々しいアラート作る為に監視します。

Keeping signal:noise ratio reasonably low is important so that developers can focus on the real regressions.

信号維持の為に：開発者は実際の回帰に焦点を当てることができるように雑音比が適度に低いことが重要です。

This could be improved by introducing the concept of confidence intervals and only alarm when it is high.

この雑音比が高いとき信頼区間( confidence intervals )のみ導入することによって改善することができます。

For different endpoints, the threshold of variation could also be set differently.

異なるエンドポイントの場合、変動のしきい値も異なって設定することができました。

One limitation for detecting CPU regressions by GEV change is that we have to manually enable the logging of those comparisons in Dynostats.

GEV変更によってCPU再起を検知するための1つの制限は、私たちがDynostatsの中のそれらの比較のロギングを手動で有効にする必要があるということです。

As the number of GEVs increases and more features are developed, this wont scale well.

GEVs増加およびより多くの特徴の数が開発されているとともに、上手く比例しなくなります。

Instead, we could leverage an automatic framework that schedules the logging of these comparisons and iterates through all GEVs, and send alerts when regressions are detected.

その代わりに私たちはすべてのGEVsを通じてフレームワークを活用しスケジュールにこれらの比較および反復処理のログを自動的にリグレッションを行い、検出された場合にアラートを送信することを可能とします。

cProfile needs some enhancement to handle wrapper functions and their children functions better.

cProfileは、ラッパー関数やその子関数を処理するためのいくつかの機能拡張を必要としします。

With the work we’ve put into building the efficiency framework for Instagram’s web service, we are confident that we will keep scaling our service infrastructure using Python.

私たちのInstagramのウェブサービス効率フレームワークを構築する仕事は、Pythonを使ってサービス・インフラストラクチャーを計り続けるである事を確信しています。

We’ve also started to also invest more into the Python language itself, and are beginning to explore moving our Python from version 2 to 3.

さらに私たちはPython言語自体にもっと投資し始めており、バージョン2〜3への移行調査も始めています。

We will continue to explore this and more experiments to keep improving both infrastructure and developer efficiency, and look forward to sharing more soon.

私たちは、インフラストラクチャおよび開発者の効率の両方を維持向上するためにこれ以上の実験を探求し続け、よりすぐに共有することを楽しみにしてます。

Min Ni is a software engineer at Instagram.

Min NiはInstagramのソフトウェア・エンジニアです。

後日談

CPUの回帰と信頼区間って何だよ!英語もっと勉強しときゃよかったぁ...と非常に嘆いている。嘆いているが、この文章が言ってる効率化の話はだいたいわかった。Instagramを使ったことがないので感触が余りわからないが5億人ユーザ(やばい)からのアクセスを処理する為にCPUが回帰する原因を探る為のCPUからの命令を読みとれるPython整のツール(Dynostats)を開発した話と、それを導入する事で50パーセント以上のCPUの回帰を回復する話なのだが、CPU自体も効率化に考慮しないと考えるとなるといよいよ、ソフトウェアなのかハードウェアなのか、知識の境界線が曖昧になってきているなと頭が痛くなる。今後安全で確立した大規模のサービスを提供するにあたっては、以前のHTTP/2のプロトコルを弄ってページの読み込み速度を上げる話にあった様にに、もっともっと下の層を弄るのが今後の流れになるのか...そうなると、エンジニア一人当たりの知識ってどんどん増えていくのかと感じた。でも今日本でCPUまで効効率化に考慮した大規模なサービスってあるのかな...?