{"id":153,"date":"2022-05-08T18:00:37","date_gmt":"2022-05-08T16:00:37","guid":{"rendered":"https:\/\/mczerwonka.pl\/dev\/?p=153"},"modified":"2022-05-09T19:09:08","modified_gmt":"2022-05-09T17:09:08","slug":"why-avoid-sprintf","status":"publish","type":"post","link":"http:\/\/mczerwonka.pl\/dev\/why-avoid-sprintf\/","title":{"rendered":"Why avoid sprintf?"},"content":{"rendered":"\n<p>I was going to write about avoiding dynamic memory allocations during parallel execution, but when I was profiling a test code I found a bug in <code>_free_dbg<\/code> which analysis consumed the time I had for this post, so I sidetracked a bit and I guess the initial topic will have to wait. <\/p>\n\n\n\n<p>But basically, dynamic allocations is not something you should do in your game code if performance matters, especially during parallel execution. If the above seems obvious this post probably is not for you. <\/p>\n\n\n\n<p>The other day I wanted to parallelize some serialization code, but the results in debug on Windows were mediocre at most. After some time I figured out that the main offender was <code>_free_dbg<\/code>, even though I preallocated all the memory before the serialization. But lets start from the beginning.<\/p>\n\n\n\n<p>Imagine I have a class Foo, which represents my game object:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-cpp\" data-lang=\"C++\"><code>struct Foo\n{\n    bool _Boolean;\n    float _FloatingPoint;\n    \/\/ ...\n};<\/code><\/pre><\/div>\n\n\n\n<p>I want to serialize those objects to a text buffer. Example implementation could look like this:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-cpp\" data-lang=\"C++\"><code>int Foo::SerializeAsText(char* Buffer) const\n{\n    int chars = sprintf(Buffer, &quot;%s&quot;, _Boolean ? &quot;y&quot; : &quot;n&quot;);\n    chars += sprintf(Buffer + chars, &quot;%.2f&quot;, _FloatingPoint);\n    return chars;\n}<\/code><\/pre><\/div>\n\n\n\n<p>Now I have my serialization in place and I want to profile it. To produce some reliable profiling data I need lots of objects, so I organized them into 1024 pages, every page containing 1024 objects.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-cpp\" data-lang=\"C++\"><code>template&lt;typename T, int PageCount = 1024, int PageSize = 1024&gt;\nstruct Storage\n{\n    std::vector&lt;std::vector&lt;T&gt;&gt; _Pages;\n    \/\/ ...\n}<\/code><\/pre><\/div>\n\n\n\n<p>This setup produces  1024*1024=1048576  objects, which should be enough for profiling.<\/p>\n\n\n\n<p>Lets start from page by page serialization.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-cpp\" data-lang=\"C++\"><code>template&lt;typename T, int PageCount = 1024, int PageSize = 1024&gt;\nstruct Storage\n{\n    template&lt;typename int (T::* Serializer)(char*)&gt;\n    void Serialize(char* Buffer)\n    {\n        for (const auto& Page : _Pages)\n            for (const auto& Object : Page)\n                Buffer += (Object.*Serializer)(Buffer);\n    }\n}<\/code><\/pre><\/div>\n\n\n\n<p>Page by page serialization takes 1015 milliseconds.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-2.png\"><img loading=\"lazy\" width=\"698\" height=\"117\" src=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-2.png\" alt=\"\" class=\"wp-image-161\" srcset=\"http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-2.png 698w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-2-300x50.png 300w\" sizes=\"(max-width: 698px) 100vw, 698px\" \/><\/a><figcaption>Page by page serialization<\/figcaption><\/figure><\/div>\n\n\n\n<p>It is going to be my baseline. <\/p>\n\n\n\n<p>Lets do the same, but with <code>parallel_for<\/code> iterating over pages of objects:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-cpp\" data-lang=\"C++\"><code>template&lt;typename T, int PageCount = 1024, int PageSize = 1024&gt;\nstruct Storage\n{\n    void SerializeParallel(char* Buffer)\n    {\n        std::for_each(std::execution::par_unseq, _Pages.data(), _Pages.data() + PageCount, [&](const std::vector&lt;T&gt;& Page)\n            {\n                const int Index = &Page - _Pages.data();\n                char* LocalBuffer = Buffer + (Index * PageSize * T::SerializedSize);\n                for (const auto& Object : Page)\n                    LocalBuffer += (Object.*Serializer)(LocalBuffer);\n            });\n    }\n}<\/code><\/pre><\/div>\n\n\n\n<p>As you can see there&#8217;s no dynamic memory allocation happening. The buffer is preallocated before calling <code>SerializeParallel<\/code>.<\/p>\n\n\n\n<p>Serializing pages in parallel takes 561 milliseconds.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-3.png\"><img loading=\"lazy\" width=\"539\" height=\"150\" src=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-3.png\" alt=\"\" class=\"wp-image-164\" srcset=\"http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-3.png 539w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-3-300x83.png 300w\" sizes=\"(max-width: 539px) 100vw, 539px\" \/><\/a><figcaption>Parallel serialization<\/figcaption><\/figure><\/div>\n\n\n\n<p>Better than the baseline, but given I&#8217;m running this on Intel i7-8750H with 12 logical cores available, it&#8217;s far from what I expected. I&#8217;m going to take a closer look at what the program is actually doing when serializing in parallel.<\/p>\n\n\n\n<p>Starting from the thread utilization:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-4.png\"><img loading=\"lazy\" width=\"525\" height=\"430\" src=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-4.png\" alt=\"\" class=\"wp-image-165\" srcset=\"http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-4.png 525w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-4-300x246.png 300w\" sizes=\"(max-width: 525px) 100vw, 525px\" \/><\/a><figcaption>Parallel serialization threads<\/figcaption><\/figure><\/div>\n\n\n\n<p>This doesn&#8217;t look good. There are lots of bubbles, where threads seem to idle or wait. I&#8217;m diving into call stacks to see where the bubbles come from just to find this:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-6.png\"><img loading=\"lazy\" width=\"608\" height=\"326\" src=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-6.png\" alt=\"\" class=\"wp-image-188\" srcset=\"http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-6.png 608w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-6-300x161.png 300w\" sizes=\"(max-width: 608px) 100vw, 608px\" \/><\/a><figcaption>Call to <code>_free_dbg<\/code> coming out of <code>sprintf<\/code> implementation<\/figcaption><\/figure><\/div>\n\n\n\n<p>It looks that <code>sprintf<\/code> under the hood calls a <code>_free_dbg<\/code> which in turn triggers synchronization due to a lock. In the above screenshot the lock consumed 15 milliseconds! How many of those does the code hit? <\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a href=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-7.png\"><img loading=\"lazy\" width=\"1024\" height=\"211\" src=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-7-1024x211.png\" alt=\"\" class=\"wp-image-189\" srcset=\"http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-7-1024x211.png 1024w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-7-300x62.png 300w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-7-768x159.png 768w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-7.png 1521w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Locks in parallel serialization<\/figcaption><\/figure><\/div>\n\n\n\n<p>A lot. Lets look at the numbers:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-5.png\"><img loading=\"lazy\" width=\"978\" height=\"266\" src=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-5.png\" alt=\"\" class=\"wp-image-167\" srcset=\"http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-5.png 978w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-5-300x82.png 300w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-5-768x209.png 768w\" sizes=\"(max-width: 978px) 100vw, 978px\" \/><\/a><figcaption>Time breakdown<\/figcaption><\/figure><\/div>\n\n\n\n<p>It seems that the serialization code spends 336 ms in locks (<code>__acrt_lock<\/code>). It is 60% time of <code>Foo::SerializeAsText<\/code>. Those numbers are crazy. <\/p>\n\n\n\n<p>Calls to <code>_free_dbg<\/code><em> <\/em>mean that <code>sprintf<\/code><em> <\/em>probably dynamically allocates memory in some cases. I don&#8217;t think it does in this situation, because there are no corresponding calls to <code>malloc<\/code>. So why does it even show in the profiler? I guess because the caller assumes that calling <code>free<\/code> with a <code>nullptr<\/code> is a no-op (does nothing). But is it?<\/p>\n\n\n\n<p>Let&#8217;s look into the Windows SDK code (10.0.19041.0):<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-cpp\" data-lang=\"C++\"><code>\/\/ free.cpp\nextern &quot;C&quot; _CRT_HYBRIDPATCHABLE __declspec(noinline) void __cdecl free(void* const block)\n{\n    #ifdef _DEBUG\n    _free_dbg(block, _NORMAL_BLOCK);\n    #else\n    _free_base(block);\n    #endif\n}\n\n\/\/ debug_heap.cpp\nextern &quot;C&quot; __declspec(noinline) void __cdecl _free_dbg(void* const block, int const block_use)\n{\n    __acrt_lock(__acrt_heap_lock);\n    __try\n    {\n        int const actual_use{block_use == _UNKNOWN_BLOCK && block != nullptr\n            ? header_from_block(block)-&gt;_block_use\n            : block_use};\n\n        free_dbg_nolock(block, actual_use);\n    }\n    __finally\n    {\n        __acrt_unlock(__acrt_heap_lock);\n    }\n}\n\n\/\/ free_base.cpp\nextern &quot;C&quot; void __declspec(noinline) __cdecl _free_base(void* const block)\n{\n    if (block == nullptr)\n    {\n        return;\n    }\n\n    if (!HeapFree(select_heap(block), 0, block))\n    {\n        errno = __acrt_errno_from_os_error(GetLastError());\n    }\n}<\/code><\/pre><\/div>\n\n\n\n<p>Calling <code>free(0)<\/code> in release actually is a no-op, but in debug it evaluates to a <code>_free_dbg<\/code> which always locks, regardless of the argument. I&#8217;ve found a <a rel=\"noreferrer noopener\" href=\"https:\/\/developercommunity.visualstudio.com\/t\/-free-dbg-implemented\/929402#T-N937578\" target=\"_blank\">thread<\/a> on Visual Studio Developer Community, where someone raised it as an issue two years ago. Hopefully it&#8217;ll be fixed one day.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<p>Getting back to the serialization &#8211; I am not going to fix <code>sprintf<\/code> so that it doesn&#8217;t allocate memory, but there are bunch of alternatives out there. I&#8217;ve found <a href=\"https:\/\/github.com\/charlesnicholson\/nanoprintf\">nanoprintf<\/a> from Charles Nicholson to be a good replacement. It&#8217;s a single header, lightweight implementation and does not allocate memory. <\/p>\n\n\n\n<p>After replacing <code>sprintf<\/code> with <code>npf_snprintf<\/code> the profiler produces this output:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-8.png\"><img loading=\"lazy\" width=\"924\" height=\"146\" src=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-8.png\" alt=\"\" class=\"wp-image-193\" srcset=\"http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-8.png 924w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-8-300x47.png 300w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-8-768x121.png 768w\" sizes=\"(max-width: 924px) 100vw, 924px\" \/><\/a><figcaption>Parallel serialize with nanoprintf instead of sprintf<\/figcaption><\/figure>\n\n\n\n<p>Thread utilization also looks nicer:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-10.png\"><img loading=\"lazy\" width=\"769\" height=\"336\" src=\"https:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-10.png\" alt=\"\" class=\"wp-image-195\" srcset=\"http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-10.png 769w, http:\/\/mczerwonka.pl\/dev\/wp-content\/uploads\/2022\/05\/image-10-300x131.png 300w\" sizes=\"(max-width: 769px) 100vw, 769px\" \/><\/a><figcaption>Much better thread utilization with nanoprintf<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>To sum this up &#8211; dynamic allocations in parallel execution are bad for performance as they lead to thread synchronization. You should strive to use preallocated memory and even then profile your code regularly to spot potential locks coming out of external libraries. <\/p>\n\n\n\n<p>You should also be careful about overusing printf-family functions in parallel code, as they might allocate memory and will always make a call to <code>free<\/code> (even with a <code>nullptr<\/code>) which currently triggers a lock on Windows when compiling in debug configuration. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was going to write about avoiding dynamic memory allocations during parallel execution, but when I was profiling a test code I found a bug in _free_dbg which analysis consumed the time I had for this post, so I sidetracked a bit and I guess the initial topic will have to wait. But basically, dynamic allocations is not something you should do in your game code if performance matters, especially during parallel execution. If the above seems obvious this post probably is not for you. The other day I wanted to parallelize some serialization code, but the results in debug\u2026<\/p>\n","protected":false},"author":1,"featured_media":189,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[23],"tags":[16,21,17,19,18,22,20],"blocksy_meta":{"styles_descriptor":{"styles":{"desktop":"","tablet":"","mobile":""},"google_fonts":[],"version":5}},"_links":{"self":[{"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/posts\/153"}],"collection":[{"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/comments?post=153"}],"version-history":[{"count":41,"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/posts\/153\/revisions"}],"predecessor-version":[{"id":212,"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/posts\/153\/revisions\/212"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/media\/189"}],"wp:attachment":[{"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/media?parent=153"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/categories?post=153"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/mczerwonka.pl\/dev\/wp-json\/wp\/v2\/tags?post=153"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}