A Closer Look at Strings

Compositor is doing a ton of string manipulation.

The core model class that does most of it, SourceModel (mentioned here), is still based around Foundation’s NSMutableString. With the recent improvements to Swift’s native string type, I plan to move the core model code away from NSMutableString, towards Swift native strings, to take advantage of the more compact UTF-8 in-memory representation (as compared to NSString’s UTF-16 in-memory format) and benefit from certain performance flags (e.g., isAscii) and corresponding optimized code paths.

I quickly realized I didn’t fully understand what happens under the hood when using NSStrings and native Swift strings in code, so I ventured to take a closer look at how the Swift compiler translates code that involves strings. I learned some techniques along the way that may be helpful for other applications, too, so I wrote up my learnings.

Examining what the Swift Compiler emits

Let’s throw some code at the Swift compiler!

Paste the following snippet into Compiler Explorer:

import Foundation

func testCastSwiftStringToNSString() {
    let swiftString = "Hello" // #1
    let nsstring = swiftString as NSString // #2
}

Note: I’m using Xcode 10.2.1 in what follows.

Creating a Swift string from a string literal (#1)

Let’s disect this. The line

let swiftString = "Hello" // #1

basically translates to

call ($sSS21_builtinStringLiteral17utf8CodeUnitCount7isASCIISSBp_BwBi1_tcfC)@PLT

Some notes on reading Compiler Explorer’s output:

Let’s see what this does at runtime. In Xcode, set a breakpoint on this line

let swiftString = "Hello" // #1

and launch the target in the debugger (I’m using a unit test target to run the code). When execution stops at this breakpoint, we’ll drop into Xcode’s console and continue on the lldb level.

Now, how to set the breakpoint for the $sSS21_builtinStringLiteral17utf8CodeUnitCount7isASCIISSBp_BwBi1_tcfC call (which is presumably an initializer invocation)?

First, we use lldb’s regular expression-capable module name lookup, passing a regex that describes what we are looking for (i.e., _builtinStringLiteral.*utf8CodeUnitCount.*isASCII):

(lldb) image lookup -rn "_builtinStringLiteral.*utf8CodeUnitCount.*isASCII"
5 matches found in /Users/ktraunmueller/Library/Developer/Xcode/DerivedData/Strings-afzvxwjhkbfwrjbkghdpnalizovg/Build/Products/Debug-iphonesimulator/StringTests.xctest/Frameworks/libswiftCore.dylib:
	Address: libswiftCore.dylib[0x0000000000177060] (libswiftCore.dylib.__TEXT.__text + 1523184)
        Summary: libswiftCore.dylib`protocol witness for Swift._ExpressibleByBuiltinStringLiteral.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> A in conformance Swift.StaticString : Swift._ExpressibleByBuiltinStringLiteral in Swift        
	Address: libswiftCore.dylib[0x00000000001836f0] (libswiftCore.dylib.__TEXT.__text + 1574016)
        Summary: libswiftCore.dylib`protocol witness for Swift._ExpressibleByBuiltinStringLiteral.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> A in conformance Swift.String : Swift._ExpressibleByBuiltinStringLiteral in Swift        
	Address: libswiftCore.dylib[0x000000000000d190] (libswiftCore.dylib.__TEXT.__text + 40736)
        Summary: libswiftCore.dylib`Swift.String.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> Swift.String        
	Address: libswiftCore.dylib[0x00000000000077e0] (libswiftCore.dylib.__TEXT.__text + 17776)
        Summary: libswiftCore.dylib`Swift.StaticString.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> Swift.StaticString        
	Address: libswiftCore.dylib[0x000000000029b000] (libswiftCore.dylib.__TEXT.__text + 2719120)
        Summary: libswiftCore.dylib`dispatch thunk of Swift._ExpressibleByBuiltinStringLiteral.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> A

Ignoring the protocol witness and dispatch thunk results, this boils down to

libswiftCore.dylib`Swift.String.init(
	_builtinStringLiteral: Builtin.RawPointer, 
	utf8CodeUnitCount: Builtin.Word, 
	isASCII: Builtin.Int1) -> Swift.String

and

libswiftCore.dylib`Swift.StaticString.init(
	_builtinStringLiteral: Builtin.RawPointer, 
	utf8CodeUnitCount: Builtin.Word, 
	isASCII: Builtin.Int1) -> Swift.StaticString

Let’s set breakpoints on both of these, again using an lldb regular expression-capable breakpoint command:

(lldb) rb "Swift.String.init\(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1\)"
Breakpoint 2: where = libswiftCore.dylib`Swift.String.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> Swift.String, address = 0x000000011c0f7190
(lldb) rb "Swift.StaticString.init\(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1\)"
Breakpoint 3: where = libswiftCore.dylib`Swift.StaticString.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> Swift.StaticString, address = 0x000000011c0f17e0

Note that we need to escape the parentheses, because they have a special meaning in regular expressions.

Ok, the breakpoints were successfully set up, so let’s continue:

(lldb) continue

We’re hitting this breakpoint:

libswiftCore.dylib`Swift.String.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> Swift.String:
->  0x11c0f7190 <+0>:   pushq  %rbp
    0x11c0f7191 <+1>:   movq   %rsp, %rbp
    ...

So the string literal translates to the creation of a regular Swift String instance. Who would have thought.

StaticString, the other candidate, is a type designed for a very narrow set of use cases where runtime modification or interpolation must be prevented. It is used in the os.os_log() API, for example.

Note: Here’s some more useful lldb commands:

(lldb) br list
Current breakpoints:
1: file = '/Users/ktraunmueller/Documents/sandbox/misc/Strings/StringTests/StringTests.swift', line = 16, exact_match = 0, locations = 1, resolved = 1, hit count = 1
  1.1: where = StringTests`StringTests.StringTests.testCastSwiftStringToNSString() -> () + 27 at StringTests.swift:16:27, address = 0x000000011a4540eb, resolved, hit count = 1 

2: regex = 'Swift.String.init\(_builtinStringLiteral:', locations = 1, resolved = 1, hit count = 1
  2.1: where = libswiftCore.dylib`Swift.String.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> Swift.String, address = 0x000000011c0f7190, resolved, hit count = 1 

3: regex = 'Swift.StaticString.init\(_builtinStringLiteral:', locations = 1, resolved = 1, hit count = 0
  3.1: where = libswiftCore.dylib`Swift.StaticString.init(_builtinStringLiteral: Builtin.RawPointer, utf8CodeUnitCount: Builtin.Word, isASCII: Builtin.Int1) -> Swift.StaticString, address = 0x000000011c0f17e0, resolved, hit count = 0 
(lldb) br delete -f
All breakpoints removed. (4 breakpoints)
(lldb) br help
     Commands for operating on breakpoints (see 'help b' for shorthand.)
...

Ok, let’s continue with our example.

Casting a Swift string to a Foundation string (#2)

The line

let nsstring = swiftString as NSString // #2

basically translates to

call ($sSS10FoundationE19_bridgeToObjectiveCAA8NSStringCyF)@PLT

Notes

To find the function signature for setting a breakpoint, we again perform a regular expression name search:

(lldb) image lookup -rn _bridgeToObjectiveC.*NSString
2 matches found in /Users/ktraunmueller/Library/Developer/Xcode/DerivedData/Strings-afzvxwjhkbfwrjbkghdpnalizovg/Build/Products/Debug-iphonesimulator/StringTests.xctest/Frameworks/libswiftFoundation.dylib:
	Address: libswiftFoundation.dylib[0x0000000000015ac0] (libswiftFoundation.dylib.__TEXT.__text + 68256)
        Summary: libswiftFoundation.dylib`(extension in Foundation):Swift.String._bridgeToObjectiveC() -> __C.NSString        
	Address: libswiftFoundation.dylib[0x00000000000e30b0] (libswiftFoundation.dylib.__TEXT.__text + 909456)
        Summary: libswiftFoundation.dylib`(extension in Foundation):Swift.Substring._bridgeToObjectiveC() -> __C.NSString

Since our code involves no Substring-producing API, it should be enough to consider the first candidate.

In this case, we can actually look up the source code for Swift.String. This seems to be what gets invoked:

extension String : _ObjectiveCBridgeable {
  @_semantics("convertToObjectiveC")
  public func _bridgeToObjectiveC() -> NSString {
    // This method should not do anything extra except calling into the
    // implementation inside core.  (These two entry points should be
    // equivalent.)
    return unsafeBitCast(_bridgeToObjectiveCImpl() as AnyObject, to: NSString.self)
  }

Checking with another name lookup for _bridgeToObjectiveCImpl:

(lldb) image lookup -n _bridgeToObjectiveCImpl
4 matches found in /Users/ktraunmueller/Library/Developer/Xcode/DerivedData/Strings-afzvxwjhkbfwrjbkghdpnalizovg/Build/Products/Debug-iphonesimulator/StringTests.xctest/Frameworks/libswiftCore.dylib:
    Address: libswiftCore.dylib[0x0000000000089950] (libswiftCore.dylib.__TEXT.__text + 550624)
        Summary: libswiftCore.dylib`Swift.Dictionary._bridgeToObjectiveCImpl() -> Swift.AnyObject        
	Address: libswiftCore.dylib[0x0000000000097070] (libswiftCore.dylib.__TEXT.__text + 605696)
        Summary: libswiftCore.dylib`Swift.String._bridgeToObjectiveCImpl() -> Swift.AnyObject        
    Address: libswiftCore.dylib[0x000000000001df00] (libswiftCore.dylib.__TEXT.__text + 109712)
        Summary: libswiftCore.dylib`Swift.Array._bridgeToObjectiveCImpl() -> Swift.AnyObject        
    Address: libswiftCore.dylib[0x0000000000168cd0] (libswiftCore.dylib.__TEXT.__text + 1464928)
        Summary: libswiftCore.dylib`Swift.Set._bridgeToObjectiveCImpl() -> Swift.AnyObject

returns Swift.String._bridgeToObjectiveCImpl(), as expected.

This should corresponds to the code in StringBridge.swift:

extension String {
  @_effects(releasenone)
  public // SPI(Foundation)
  func _bridgeToObjectiveCImpl() -> AnyObject {
    if _guts.isSmall {
      return _guts.asSmall.withUTF8 { bufPtr in
        return _createCFString(
            bufPtr.baseAddress._unsafelyUnwrappedUnchecked,
            bufPtr.count,
            kCFStringEncodingUTF8
        )
      }
    }
    if _guts._object.isImmortal {
      // TODO: We'd rather emit a valid ObjC object statically than create a
      // shared string class instance.
      let gutsCountAndFlags = _guts._object._countAndFlags
      return __SharedStringStorage(
        immortal: _guts._object.fastUTF8.baseAddress!,
        countAndFlags: _StringObject.CountAndFlags(
          sharedCount: _guts.count, isASCII: gutsCountAndFlags.isASCII))
    }

    _internalInvariant(_guts._object.hasObjCBridgeableObject,
      "Unknown non-bridgeable object case")
    return _guts._object.objCBridgeableObject
  }
}

Since we’re casting a small string, I’d expected the first if to be entered, and the _createCFString() method to be called:

@_effects(releasenone)
private func _createCFString(
  _ ptr: UnsafePointer<UInt8>,
  _ count: Int,
  _ encoding: UInt32
) -> AnyObject {
  return _swift_stdlib_CFStringCreateWithBytes(
    nil, //ignored in the shim for perf reasons
    ptr,
    count,
    kCFStringEncodingUTF8,
    0
  ) as AnyObject
}

_createCFString() calls into _swift_stdlib_CFStringCreateWithBytes:

_swift_shims_CFStringRef
swift::_swift_stdlib_CFStringCreateWithBytes(
    const void *unused, const uint8_t *bytes,
    _swift_shims_CFIndex numBytes, _swift_shims_CFStringEncoding encoding,
    _swift_shims_Boolean isExternalRepresentation) {
  assert(unused == NULL);
  return cast(CFStringCreateWithBytes(kCFAllocatorSystemDefault, bytes, numBytes,
                                      cast(encoding),
                                      isExternalRepresentation));
}

which finally calls CoreFoundation’s CFStringCreateWithBytes().

Let’s set a breakpoint for CFStringCreateWithBytes() and see if we hit it. Continuing (twice) confirms our guess:

(lldb) rb CFStringCreateWithBytes
Breakpoint 2: 6 locations.
(lldb) continue
Process 47888 resuming
(lldb) continue
Process 47888 resuming
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: 0x000000010372dfb0 CoreFoundation`CFStringCreateWithBytes
    frame #1: 0x000000011a9d315e libswiftCore.dylib`function signature specialization <Arg[0] = Exploded> of Swift.String._bridgeToObjectiveCImpl() -> Swift.AnyObject + 110
    frame #2: 0x000000011a801079 libswiftCore.dylib`Swift.String._bridgeToObjectiveCImpl() -> Swift.AnyObject + 9
    frame #3: 0x000000011ae87ac9 libswiftFoundation.dylib`(extension in Foundation):Swift.String._bridgeToObjectiveC() -> __C.NSString + 9
    frame #4: 0x0000000118ae612c StringTests`StringTests.testCastSwiftStringToNSString(self=0x00007f9817c0cea0) at StringTests.swift:17:24

So line #2 in our example allocates a new instance of a CFString, which is bitcast to NSString by String’s _bridgeToObjectiveC() method:

unsafeBitCast(_bridgeToObjectiveCImpl() as AnyObject, to: NSString.self)

Wrapping Up

Ok, that’s quite a lengthy analysis of two lines of Swift code, but it helped me get a better understanding of what happens under the hood when writing Swift code that deals with strings. I hope you find it useful, too!

References