CoreCLR's environment is not your environment

CoreCLR maintains its own private copy of environment variables

This came up when I was helping another collegue in a previous job (where I still write C# code probably 50% of the time), diagnosing a library load failure problem inside a linux container. Internally there is this library that loads different implementations (mock implementation and real implementation) of another library based on a environment variable USE_MAGIC_TEST_LIB, and the .NET code calling that library is calling SetEnvironmentVariable to set it conditionally as part of a testing framework:

if (useTestFramework)
{
    Environment.SetEnvironmentVariable("USE_MAGIC_TEST_LIB", "1");
}

NativeCode.CallToSomeNativeLibraryThatLoadsDifferentLibraryDependsOnEnv();

This looks reasonable except that it didn’t work at all. It loaded the wrong library and things quickly went down hill after that.

We were scratching our heads for a while until we decided to add a trace to see if the environment is actually taking effect or not in the native code. Interestingly, the native code didn’t see it at all. It’s like they don’t know about each other’s environment!

Actually that observation is more or less what’s going on. Before we dig in a bit deeper, here is a bit of history of CoreCLR cross-platform implementation. Not surprisingly, .NET code started as Windows centric and all the OS calls are strictly Windows API. At some point folks decide to port it to linux/Mac as part of Rotor (later Silverlight), there are two options:

  1. Design a new platform abstraction from scratch and move it to that
  2. Align the API design to Windows and implement Windows API on top of Linux API

2 is obviously the cheaper solution and has the advantage that Windows code would be untouched and therefore won’t get regressions, which is super important. The caveat is that implementing Windows API using Linux API can get tricky, but is the risk people are willing to take. So the new PAL layer is introduced with “new” APIs that looks exactly like Windows APIs implemented using Linux APIs.

In the case of SetEnvironmentVariable, it is implemented in PAL/environ.cpp:

BOOL
PALAPI
SetEnvironmentVariableA(
            IN LPCSTR lpName,
            IN LPCSTR lpValue)
{
    // ...

    // All the conditions are met. Set the variable.
    int iLen = strlen(lpName) + strlen(lpValue) + 2;
    LPSTR string = (LPSTR) PAL_malloc(iLen);
    if (string == nullptr)
    {
        bRet = FALSE;
        ERROR("Unable to allocate memory\n");
        SetLastError(ERROR_NOT_ENOUGH_MEMORY);
        goto done;
    }

    sprintf_s(string, iLen, "%s=%s", lpName, lpValue);
    nResult = EnvironPutenv(string, FALSE) ? 0 : -1;

    PAL_free(string);
    string = nullptr;

    // If EnvironPutenv returns FALSE, it almost certainly failed to allocate memory.
    if (nResult == -1)
    {
        bRet = FALSE;
        ERROR("Unable to allocate memory\n");
        SetLastError(ERROR_NOT_ENOUGH_MEMORY);
        goto done;
    }

This looks a bit fishy. It’s allocating its own buffer and calls into EnvironPutenv, which basically does this:

    for (i = 0; palEnvironment[i] != nullptr; i++)
    {
        const char *existingEquals = strchr(palEnvironment[i], '=');
        if (existingEquals - palEnvironment[i] == nameLength)
        {
            if (memcmp(entry, palEnvironment[i], nameLength) == 0)
            {
                free(palEnvironment[i]);
                palEnvironment[i] = copy;

                result = TRUE;
                break;
            }
        }
    }

    if (palEnvironment[i] == nullptr)
    {
        _ASSERTE(i < palEnvironmentCapacity);
        if (i == (palEnvironmentCapacity - 1))
        {
            // We found the first null, but it's the last element in our environment
            // block. We need more space in our environment, so let's double its size.
            int resizeRet = ResizeEnvironment(palEnvironmentCapacity * 2);
            if (resizeRet != TRUE)
            {
                free(copy);
                goto done;
            }
        }

        _ASSERTE(copy != nullptr);
        palEnvironment[i] = copy;
        palEnvironment[i + 1] = nullptr;
        palEnvironmentCount++;

        result = TRUE;
    }

So it’s basically managing its own memory in palEnvironment environment array! No wonder things don’t work.

But why go through all the trouble while Linux has its own getenv/setenv?

These posts provides the answer:

http://rachelbythebay.com/w/2017/01/30/env/

Modifications of environment variables are not allowed in multi-threaded programs. – the glibc manual

https://github.com/dotnet/coreclr/issues/635

From looking at the code, I suspect that the cached environment was attempt to fix thread safety or consistency problems between Environment.GetEnvironmentVariables and Environment.SetEnvironmentVariable.

The enumeration of the environment starts by reading the environ global variable, without any locks. Consider what may happen if somebody calls setenv while the enumeration is in progress.

It’s because setenv/getenv isn’t particularly thread safe - you can crash when reading environment while the environment get modifed by another thread, or two threads modifying environment at the same time can lead to leaks.

In this case, one can see a few options:

  1. Do nothing - the issues are linux-specific and you should take care when calling these functions, the same way just like you call them in linux.
  2. Throw PlatformNotSupported - getenv/setenv just isn’t safe
  3. Adding critical section around getenv/setenv - make them safe to be called in multiple threads
  4. Implement your own safe environment helpers - as a result rest of the native library won’t observe the change through getenv/setenv

1 isn’t really acceptable because .NET developers need the code to be portable - they don’t want handle the platform special oddies to make their code portable. They would like .NET platform library to be safe and reliable.

2 isn’t great either for the same reason, and also it’ll break a ton of code when ported to linux.

3 makes .NET code safe, but it wouldn’t protect against native code racing with getenv/setenv calls from within .NET code, so race conditions would still occur and .NET developer has little control.

4 is safe, but can lead to subtle breaking changes.

Unfortunately there isn’t a great option here. 1, 2, and 4 are safe option, but all of them have their downsides. At the end of the day, it comes down to compatibility/portability vs surprising behavior. .NET team favors compatibility and portability. While it can lead to sutble breaking changes, fortunately the breaking changes are consistent and therefore easier to diagnose. In our case being a container launched by another system makes the whole thing much harder to diagnose, but that’s more a problem of the container launcher itself. Even though we were bit by the very same problem, I agree it is most likely the better choice.

For more information, you can refer to CoreCLR code here:

https://github.com/dotnet/coreclr/blob/master/src/pal/src/misc/environ.cpp#L607

And here is a simple code snippet to demonstrate the issue. I’ve tested this on my MBP.

using System;
using System.Runtime.InteropServices;


namespace set_env
{
    class Program
    {
        [DllImport("/usr/lib/system/libsystem_c.dylib")]
        static extern IntPtr getenv(string name);

        [DllImport("/usr/lib/system/libsystem_c.dylib")]
        static extern int setenv(string name, string value);

         static void Main(string[] args)
        {
            string envName = "MY_ENV";

            Console.WriteLine("MY_ENV={0}", Environment.GetEnvironmentVariable(envName));

            Environment.SetEnvironmentVariable(envName, "~/path");
            Console.WriteLine("Setting it to ~/path");

            Console.WriteLine("MY_ENV={0}", Environment.GetEnvironmentVariable(envName));

            IntPtr env = getenv(envName);
            string envStr = Marshal.PtrToStringAnsi(env);
            Console.WriteLine("getenv(MY_ENV)={0}", envStr);

            Console.WriteLine("Setting it using setenv");
            setenv(envName, "~/path");

            env = getenv(envName);
            envStr = Marshal.PtrToStringAnsi(env);
            Console.WriteLine("getenv(MY_ENV)={0}", envStr);
        }
    }
}

Now if you are interested to dig in a bit more, here are some bonus questions for your consideration:

  1. Does the same problem happen in Windows? And why/why not?
  2. Why does the above code above use Marshal.PtrToStringAnsi instead of just have getenv returning the string?

That’s all for now. Thanks for reading!

Sorting structured data in a unstructured way using memcmp-friendly encoding Part 1 - sorting integers and strings

Sorting structured data in a unstructed way using memcmp-friendly encoding part 1

In many interesting data storage applications, one often need to sort the data in order to do efficient searching - it is such a fundamental operation. The data is often structured like in databases (or it is using a database under the hood) - the application knows exactly what the data is - for example, a record/row with an integer field, a string field, as well as datetime field, etc. In those cases, you can easily sort these data by interpreting the data as what it is, and then comparing them one by one. This is usually achieved by having a base class with virtual functions, and having several derived class implementing the comparison function as well as determine the length to move the next one:

class Data {
 public:
  virtual int Compare(void *left, void *right) = 0;
  virtual int GetLength(void *data) = 0;
};
class UInt32_Data : public Data { public:
  virtual int Compare(void *left, void *right) {
    auto left_int = reinterpret_cast<uint32_t *>(left);
    auto right_int = reinterpret_cast<uint32_t *>(right);
    if (left_int < right_int) {
      return -1;
    } else if (left_int == right_int) {
      return 0;
    } else {
      return 1;
    }
  }
  virtual int GetLength(void *data) {
    // No need to read data - uint32_t is always fixed size
    return sizeof(uint32_t);
  }
};

You can merge these two function into one - for better efficiency. For clarity I’m keeping them separate.

Besides virtual functions, you can also implement this with a jump table that points to a list of functions, or even a switch / case, etc. They are not that fundamentally different - all of them can involve having a table of address to call/jump to, and use a memory lookup to determine the target address.

However, the additional cost of invoking the right comparing functions isn’t zero - as a matter fact it is quite significant comparing to the actual comparison function itself in the case of a virtual function call, which involves putting arguments into registry/stack, pushing return address into stack, setting up frame pointer, etc.

If you are an experienced system programmer, you might know tricks to optimize this further. For example, given this is the exact same problem as an interpreter, and people like interpreter to be fast, VM like Dalvik employed advanced techniques like writing the code in assembly, using threaded execution (which is a fancy way of saying the end of interpreter loop decodes the next instruction instead of jumping to the beginning), and using computing address instead of jump table. These are interesting topics that I might talk about at some point in the future. Anyway, those are not easy to implement and maintain.

But are there other ways to get around this? Is there a way to compare this without understanding what the data is?

The most straight-forward comparison is a byte comparison or memcmp, and this is the most universal way to compare two byte sequences. Many key/value stores (like levelDB/RocksDB) only support byte comparison and allow you to plugin a custom comparator. But before you go ahead and try to implement the custom comparator, let’s give one idea a try: what if we can represent the data somehow as a byte-comparison friendly format?

The challenge are two fold:

  1. Encoding the data so that byte order is the correct sorting order
  2. Support variable length data properly so that you don’t accidentally compare unrelated data

Unsigned 32-bit integer

Let’s start with something most straight-forward: a 32-bit unsigned integer. Assume you are working on a Intel machine just like everybody else (and not something esoteric like SPARC) - those unsigned 32-bit integers are represented as little-endian, which means least significant byte will be in memory before the most significant ones. So 0x12345678 will be represented in memory as:

0x78, 0x56, 0x34, 0x12

Obviously this isn’t what we want - we need the compare the most significant byte first, which is exactly Big-Endian:

0x12, 0x34, 0x56, 0x78

Now it’s safe to do a memcmp them now - the bytes are in most-significant to least significant order, and the length is fixed 4-bytes.

Now those SPARC CPU looks pretty sweet, right?

Signed 32-bit

Let’s make this a bit more interesting. What if the integer is signed?

For signed 32-bit, the range is -2147483648 to +2147483647. There are two cases:

  1. Negative: -214783648 (0x10000000) to -1 (0xffffffff),
  2. Non-Negative: 0 (0x00000000) to +2147483647 (0x7fffffff)

The non-negative case looks alright - just like the unsigned 32-bit integer case, as long as they are converted to Big-Endian.

For the negative case, the order is correct: -214783648 (0x10000000), -214783647 (0x10000001), … (0xffffffff), except the most significant bit is always one, which makes it bigger than the non-negative case. It is not hard to come up with a fix - just flip the sign bit. Now it becomes:

  1. Non-Negative: -214783648 (0x00000000) to -1 (0x7fffffff),
  2. Negative: 0 (0x80000000) to +2147483647 (0xffffffff)

Now this looks really nice, and these two ranges are now in the right order, and -1 (0x7ffffffff)+1 = 0 (0x80000000). Now the universe is balanced.

8-bit ASCII Strings

For strings, let’s again start with the easy case - 8-bit ASCII strings (we’ll refer it to ASCII string from now on, just for simplicity). For a fixed length ASCII string, it’s really easy - memcmp just works. But how about variable length ASCII?

In such case, the real problem happens when you have string A and B and A is a prefix of B:

A: A, A, A, A, A, A
B: A, A, A, A, A, A, B, B, B

What if just after A there is other data:

A: A, A, A, A, A, A, 0x43
B: A, A, A, A, A, A, B, B 

In this case, 0x43 = ‘C’ which is bigger than B, even though A string is smaller than B. Oops.

The key to the problem is that you have to compare the two strings by themselves - you can’t compare other unrelated data by accident (which is our challenge #2, earlier, if you paid attention). You could pad the strings so that they are equal, if you know the maximum length ahead of time (for example, in SQL VARCHAR has max length), but that can be a waste of space.

If you dig deeper, one interesting insight is that if you can have a magic special character that is always guarantee to be smaller than any valid contents in the other string before it ends, then it’ll just work. In many cases, we had no such luxury as strings may have embedded NULLs. But that does provide some additional hint: what if we can artifically inject such marker into the string such that the one that is longer has a bigger byte marker?

A: A, A, A, A, A, A, 0x0
B: A, A, A, A, A, A, 0x1, B, B, B, 0x0

In the above case, A ends with 0x0, while B injects 0x1 as 7th char, making sure it is bigger than A when A ends. Note that 0x1 in this case means there are more data after this, so the encoder/decoder need to take that into account. This looks nice, but we do need to make sure the markers are always at the same place. In order to do that, we can pad/break the strings to split them into predictable fixed length parts with a marker at the last byte. Let’s say if we break it apart at 6 characters, it’ll be exactly like this:

A: A, A, A, A, A, A, 0x0
B: A, A, A, A, A, A, 0x1, B, B, B, 0x0, 0x0, 0x0, 0x0

Note the 4 0x0 (‘ ‘) padding in between, making sure we break the strings every 6 characters. Now, any experienced programmer will tell you that you should always ends things at power of 2 (so that it works better with cache, alignment, etc), so 8/16/32/… would be obviously a better choice. For now let’s go with 8 just to make it easier:

A: A, A, A, A, A, A, 0x0, 0x0
B: A, A, A, A, A, A,   A, 0x1, B, B, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0

A bit wasteful, but much better than storing the entire string padded to max length. Also, keep in mind that this encoding supports storing NULL characters as the 0 at every 8th character has special meaning.

But we are not done yet. Do you see there is one more problem?

We are padding the strings with 0x0, and now the strings have some unwanted 0x0 characters padded which we are not able to distingush with actual spaces. Fortunately we still have plenty of run away with the encoding, we can put 1~8 there to indicate number of real characaters (not the padding):

A: A, A, A, A, A, A, 0x0, 0x0
B: A, A, A, A, A, A,   A, 0x2, B, B, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0

But this isn’t quite right yet, this can easily get broken (thanks for Thief pointing it out) as the marker themselves get into comparison:

A: A, A, A, A, A, A, A, 0x3, A, A,   A, 0x0, 0x0, 0x0, 0x0, 0x0
B: A, A, A, A, A, A, A, 0x2, B, B, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0

To fix this, instead of signaling the number of characters in next segment, it can represent the number of characters in the current segment:

A: A, A, A, A, A, A, A, 0x7, A, A, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2
B: A, A, A, A, A, A, A, 0x7, A, A,   B, 0x0, 0x0, 0x0, 0x0, 0x3

For non-NULL characters, it’ll work as any other character will be bigger. For embedded NULL characters, either the last non-NULL character would help:

A: A, A, A, A, A, A, A, 0x7, A, A, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2
B: A, A, A, A, A, A, A, 0x7, A, A, 0x0,   A, 0x0, 0x0, 0x0, 0x4

Or for pure NULL padding case, the last 0x2/0x4 will help disambuigate any difference.

A: A, A, A, A, A, A, A, 0x7, A, A, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2
B: A, A, A, A, A, A, A, 0x7, A, A, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4

This is still not quite perfect, though. If a string happens to end at N boundary:

A, A, A, A, A, A, A, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0

The final N bytes are wasted just to provide the indicator. The fix is simple: instead of N - 1 indicating more characters, we can have two cases:

  1. N - 1 : the segment is full and there are no more characters
  2. N : the segment is full and there are more characters

To illustrate this idea:

A, A, A, A, A,   A,   A, 0x8, B, B, B, 0x0, 0x0, 0x0, 0x0, 0x3
A, A, A, A, A,   A,   A, 0x7 
A, A, A, A, A, 0x0, 0x0, 0x5  

In summary, we break down the string in the chunk of 8/16/32/… characters and using the every Nth character a special marker that indicates:

  1. 1 ~ N - 1 : the number of characters in the current segment. The rest is 0x0 padding.
  2. N : the current segment is full and there are more characters

What about non-ASCII strings?

We can always convert it into a case that we know about - if we can convert such string into UTF-8, which works great in byte comparison even in the case of multi-byte characters. If you haven’t looked at it yet, you should. It’s brilliant, and everyone should be talking in UTF-8 (I’m looking at you, Windows). Just go to https://en.wikipedia.org/wiki/UTF-8.

What about sort ordering and collation?

Those cases can get really complicated depends on the encoding + collation so I won’t get into them. But the idea is always the same: transform the bytes into the correct sorting order as dicated by the character set/encoding. If the encoding byte order happens to be the right sorting order (UTF-8, for example), all the better.

If you are interested you might want to give your favorite encoding a try.

We are not done yet

In this post I’ve discussed approaches to encode your data in a way that is sort friendly. In cases where there are lot of read/seek/lookup, it can make a really huge difference in terms of lookup performance, but in write heavy environments it may not be the right trade-off as the overhead of the encoding become bigger and the caching to offset the decoding cost become less effective. At the end of the day, there is no silver bullet and you need to pick the right solution for your scenario at hand.

However, we are not quite done yet. There is one more interesting scenario we can look at: floats. This is a non-trivial topic as we need to dive into the binary format of floats/doubles. Hope I’ll see you next time!

EDIT NOTE: This article has been updated to address some bugs in the examples and in the encoding. Thanks everyone for the suggestion/corrections!

The tale of InvalidOperationException

This came up when I was helping someone looking into a strange issue. Recently some process dump capture tool started throwing InvalidOperationException:

Unhandled Exception: System.InvalidOperationException: Process was not started by this object, so requested information cannot be determined.
   at System.Diagnostics.Process.get_StartInfo()
   at SomeTestUtility.WaitForSomeProcessToExit()

Summary of database isolation levels

Summary of different database isolation levels

I’ve been always a bit confused about different isolation levels on databases. So I spent a bit of time going through them and summarize it in a post. This is both for myself and also hopefully will help others looking for similar information as well.

In this post we’ll talk about:

  • Meaning of different isolation levels - read committed, repeatable reads, snapshot isolation, serializable
  • How they use locks inside transactions
  • How they affect performance

typeof(TSecret) - the secret magic behind .NET generics

How typeof(T) works in .NET generics

In last post we’ve talked about how .NET does code sharing for reference types. This time let’s take a look at how typeof(T) does its black magic.

In particular, how does the code knows what typeof(T) is, in the presence of code sharing?

Obviously if there is no code sharing at all, each method instantiation are different and the code would be instantiated with the correct typeof(T) code where T is a real type, it obviously would “just work”.

Before we dive into the implementation, let’s take a first crack at this problem and see what are the challenges.

Does Go need async/await?

Does the go language need async/await support?

Recently I’ve been spending a bit of time playing with Go language. Coming from a C++ and C# background, I’ve been wondering what does C# async await (C++ coroutines, or node.js await/promise - pick your choice), would look like in Go.

It turns out go does not need async/await - every thing can be synchronous by default.

Let me explain why I arrived at this conclusion.

But before that, let’s take a look at why we need the async/await in the first place.

Top secret .NET handles - Part 2 - Ref-Counted handles

Top secret .NET handles - Part 2 - Ref-Counted handles

Last time we talked about .NET dependent handle. It is a handle that promotes secondary if primary is promoted - as if there is a imaginary reference between them. This time let’s take a look at another secret handle - ref-counted handle. A ref-counted handle is a special handle that will become either strong or weak depending on the ref count. It’s only used in COM interop today internally in the CLR.

You can find its definition in gcinterface.h

    /*
     * REFCOUNTED HANDLES
     *
     * Refcounted handles are handles that behave as strong handles while the
     * refcount on them is greater than 0 and behave as weak handles otherwise.
     *
     * N.B. These are currently NOT general purpose.
     *      The implementation is tied to COM Interop.
     *
     */
    HNDTYPE_REFCOUNTED   = 5,

Top secret .NET handles - Part 1 - Dependent handles

Top secret .NET handles - Part 1 - Dependent handles

.NET publicly has documented 4 kind of handles:

  1. Weak (also called Short Weak) - Don’t keep target object alive and will return null when object is gone. The target will become null when the object enters for finalization.

  2. WeakTrackResurrection (also called Long Weak) - Don’t keep target object alive and will return null when object is gone. It’ll return the object even when the object is being finalized or resurrected.

  3. Normal (also called strong) - keeps target object alive. If you are not careful, you may leak the object.

  4. Pinned - Keeps the target object alive and prevents GC from moving this object around. Useful when you are passing this object to native code, as native code won’t know if GC moved it. Note that using a lot of pinning handles may degrade GC performance. The most common offender is pinned strings/arrays in interop calls.

You can also find them described in GCHandle enumeration.

However, besides these 4 types, there are actually more secret internal handle types that are not exposed. In this post I’ll be talking about dependent handle, and why it is totally awesome.

Calling C functions from Python - part 3 - deep dive into ctypes implementation in CPython

Calling C functions from Python - part 3 - deep dive into ctypes implementation in CPython

Last time we’ve looked at using ctypes to call C API, and writing extension module using Python/C API. Now we can finally tie these two together - looking at how ctypes is actually implemented using mix of Python/C API and Python code.

  • You can find CPython source code here.
  • ctypes’ C implementation is here
  • ctypes’ python implementation is here.

Pagination