khuuyj@: 4月 2013

Blender Net Renderの現状の問題と対策

→いつものひどい英語判/Terrible English version as always

BlenderのNetRender add-onはユーザー設定画面で見て判る通り、ベータ版となっているが、現時点では以下の問題があると思われる

(1)大きなファイル、もしくは大量の外部ファイルを含むデータのレンダリングではクライアントがエラーを起こしやすい
(2)フレーム当たりのレンダリングが長時間かかる場合はエラーになる
(3)Force upload all filesを指定してもリンクデータの参照に失敗する

(3)の対策は、将来対応されることを期待しつつ、全てのスレイブサーバーにクライアントと同じパスで各データを参照出来るように設定しておくことで対処するとして、(1)(2)は実はタイムアウトの判定の方法に問題がある。

(2)に関しては、実際には通信用モジュールがタイムアウトを起こしているのではなく、応答間隔が長かった場合、つまり前回の通信から今回の通信までの間隔が一定以上であればエラーを発生させている、もしくは該当スレイブをジョブの対象リストから除外すると言う作業を行っている。
つまり、通信が出来ないためにタイムアウトを起こしている訳ではなく、処理の遅いサーバーをメンバーから外すと言う処理を行っている。
但し、このタイムアウトにする閾値が固定で与えられているため、例えば、閾値が30分であれば、1フレームの処理に最速のサーバーで1時間掛かる場合には100％失敗となる。

現時点、この閾値は5分になっている。
CGWORLD誌でトランスフォーマーのレンダリング時間が（最長？）1フレーム70時間と紹介されていたようだが、勿論、お話にならない。

そして、この対策だが、マスターのソースコードのタイムアウト時間を与えている箇所を修正する。
実際のソースコード上でもコメントで、UIで変更出来るようにしなければならないと言う旨が書かれている。
但し、この対策法、問題が無い訳ではない。
ジョブをキャンセルした場合も、次のジョブを受け入れるまでこの時間待つようになる。
つまり、レンダリング、1フレーム当たり1時間のジョブを実行出来るようにすると、ジョブをキャンセルした場合に次のジョブを受け入れるまでに1時間掛かる。

生存通知の有無によるタイムアウトと、処理が遅いことによるタイムアウトを区別していないために起こる仕様上の問題だが、ここを変更して対応時間を伸ばすと、ジョブをキャンセルして新しいジョブを投入する場合には手動でスレイブを再起動する必要がある。

クライアントのタイムアウトも固定で5秒タイムアウトするようになっている。
現状パイソンはマルチコアに対応していないし、コード自体も並行処理で生存通知を行うようになっていないので、データサイズが大きかったり回線が遅くて転送時間が長くなると、タイムアウトする。
クライアントがタイムアウトすると、

(1)データの転送が正常に終了したか判らない
(2)マスターからレンダリング済データをダウンロード出来ない

の2点が問題となるが、LAN内やFTTH環境下ではほぼ起こらない上、(2)はウェブUIかクライアントのボタンからダウンロード出来るので、問題としてはさほど大きく無い。

The problem of Blender NetRender and the countermeasure against it

→Python Code

Blender NetRender is Beta version as you can know from user preference, then I think that it has these problem following here.

(1)The error often happen on client when sending the large sized file or the data having many linked files.
(2)It can't render the data that its render need many time for 1 frame rendering.
(3)It failure to refer linked files even if ordering the option "Force upload all files".

About (3), we expect to be resolved in future, and then we make slave servers to be able to refer data directories with same path as client's.
About (1)(2), it has the problem in the way to decide timeout.

In (2)'s problem, it is not that the communication module is not in timeout, but it occur when current response arrive after spending many time from last response, so it is not that there is not response. So if the time spans were long, the master would cut off those slaves as low performance machine.
However, this thresholds are fixed in current code. For example, if you rendered such data as to spend 1 hour also with the most performance machine and if the threshold were 30 minutes, that job should lose certainly.

In current code, it was set to 5 minutes.
Before there was the article about Transformer TV animation in CGWORLD( a magazine in Japan ), it said that its rendering needed 7 hours / frame (it is the longest time of them?), so such is nothing but joke in current Blender NetRender.

The way against that problem, you must change python code ( the comment to need to make it adjustable by UI is written in the code, but yet ).

But this countermeasure has also problem.
If you enlarge the length until timeout, also the wait time for next job is enlarged.
So if you set 1 hour for timeout, the NetRender could process the task that needed 1 hour, but if you cancelled that job, the slaves would not receive next job for 1 hour.

It is because of bug in the specification, the timeout against impossible to communicate and the timeout against low speed machine are not distinguished. But that is why we must restart slaves with handle when when we cancel job ( or wait for that time ).

In client, there is the problem that the timeout length is given as fixed parameter, and it is 5 seconds.
That is why it finish with error if it need many time for sending by its data size or the network through put, becaise current python don't support multi-core and this program can't send survival signal while sending.
If client goes to timeout,the problems are these.

(1)We can't know whether the sending have finished or not.
(2)Client can't download the rendered results from master automatically.

The problem of (1) rarely happen in LAN or FTTH, and (2) is not so serious because we can download the result from master with web UI or client's get button.

マスターの待機時間を1時間、クライアントの待機時間を2分に延長する
Extend master wait time to 1 hour, client's to 2 minutes

Directory	./2.66/scripts/add-ons/netrender
File	master.py		utils.py
Befoer	984 line	self.slave_timeout = 5 # 5 mins: need a parameter for that	162 line	def clientConnection(netsettings, report = None, scan = True, timeout = 5):
After		self.slave_timeout = 60 # 5 mins: need a parameter for that		def clientConnection(netsettings, report = None, scan = True, timeout = 120):

※1フレームに30分を要する場合、chunk=5だとマスターの待機時間は150分以上必要になる
If it need 30 minutes per frame and chunk is 5, then the master wait time need more than 150 minutes.
※クライアントの待機時間は一見初期値が5秒だが、これを変更するUIは存在しないので実質固定値
The wait time of client looks just initial value in shortsighted, but there is not UI for changing that, so it is fixed parameter substantially.

コードを見た感じ、複数ソースに跨っているためざっくりと見ただけだが、スレイブのタイムアウトはWebUIの当該項目が更新されるタイミング＝スレイブの場合はJobIDが変更されるタイミングで行っているっぽい。このため一覧からステータスの変わらないスレイブが外れやすく、ジョブが実行中にも関わらずエラーになるっぽい。
と言うことはウェブ画面を出さなければ、マスターがスレイブを除外する処理が走らない？

Though I saw just rough , the reason of slave timeout seems that it is checked at when the part of slave status is changed in webUI. That is why the slaves that its status is not changed are cut off, and the job become error even though it is running.
So the task that master reject slaves will not run if I don't use web UI?

ウェブ画面無しでもタイムアウトするタイミングは同じでした。残念。
The term to timeout was same even if without web UI.

khuuyj@

2013年4月19日金曜日

Blender Net Render

ページ