performance-tuning high-capacity cryptostorm sessions

Looking for a bit more than customer support, and want to learn more about what cryptostorm is , what we've been announcing lately, and how the cryptostorm network makes the magic? This is a great place to start, so make yourself at home!
cryptostorm_ops

performance-tuning high-capacity cryptostorm sessions

Post by cryptostorm_ops » Tue Feb 04, 2014 7:47 am

Our ops team has been, in recent weeks, doing some intensive work server-side in support of maximizing the broadest range of high-bandwidth network use-case scenarios. This is an ongoing process, as network dynamics are essentially an emergent phenomenon at the level of topological modification we've done to the foundations of packet transit in our security model. This means, basically, that there are occasionally "backward steps" in perceived/tested throughput as we test out various parameters in exitnode clusters, review the test data, and tune accordingly.

I've been asked to summarise some of this work in a public thread, and I will do my best to do so.

To start, I am bringing over some text from email messages I've been exchanging with network members who are able to push high-bandwidth test sessions through exitnodes and report on results. For reasons that are pretty obvious when considered fully, having a broad swath of test-case examples is very helpful in real-world perf-tuning a network of this sort. No matter how much in-house work we do, we're limited by our local network parameters - and of course "testing" the network from inside the network is basically pointless.
We're in the midst of fine-tuning some of the clusters, to support maximum session throughput. It's an ongoing process, and right now the Montreal cluster is... well, in the middle of a dip in performance as we get things settled the way we like them.

My advice is that you - for now, at least - direct a session specifically at our Icelandic cluster. It's been perf-tuned extensively, and should be able to support high throughput consistently. (all of our clusters have ample capacity; perf-tuning at this level isn't about raw capacity, but rather a series of cascading kernel settings relate to packet buffers and socket allocation)

I believe there's a forum thread on selecting clusters, but short version is that you can change the config file to specify this remote parameter:

cluster-iceland.cryptostorm.net

If there's any problems with this, let me know right away. We're always keen to have high-throughput members who can help assist in performance tuning, as it's a big focus of our network.

Finally, please note that some "speed test" applications will give erratic results when routed through our network - they assume certain topological structures in how they calculate and throttle throughput during testing, and those assumptions can run up against the layers of SNATted de-coupling & attendant re-packetization - that happen during transit through cryptostorm. So, while naive throughput testing is useful, it's really helpful to test several with several tools and also, if possible, do a couple of brute-force wget file grabs to see how they perform. Last: because of the way we dynamically allocate socket assignments for network sessions at the kernel level (using a stochastic round-robin algorithm rather than the FIFO-based Linux defaults), it can take a few seconds for big single-socket (i.e. TCP) sessions to spool up to capacity.

Generally, we see an initial jump to a megabit or so, then a lull for a second or two, then a linear spool to full capacity (this is why some speed test apps show odd results - they see that first plateau, and throttle down data flowing into the test pipe, resulting in a self-fulfilling error). That's why a raw wget can be useful, as it's a "dumb" test and just dumps packets into a TCP session as fast as it possibly can. Those will, then, fill all the cascading packet buffers as the session punches across cryptostorm's network interfaces, and as that happens reported throughput will see big jumps... which makes sense, if you imagine a series of buckets that each must fill, before dumping into the next one: it looks like it's a slow process, at first, but once all the buckets are brimming water will pass through the series at a monotonically increasing rate.

Thanks for the ongoing feedback,

~ cryptostorm_ops

cryptostorm_ops

Re: performance-tuning high-capacity cryptostorm sessions

Post by cryptostorm_ops » Tue Feb 04, 2014 8:19 am

Another thing I frequently see is technically unsophisticated VPN review articles that report test results for networks. Actually, I don't think I've ever seen a VPN review test done that is at all useful in tracking actual network performance.

What I have seen is some companies that pick settings for their servers that throw "good" test results with speedtest.net, but suck for actual network use. That's actually very common, and is really unfortunate because then people try to to use these "fast" networks and find their actual performance is crap.

Without getting into too much boring detail, the kinds of data that flow across cryptostorm's network backbone are enormously variable. There's everything from low-bandwith TCP sessions to massive-volume, state-free, UDP-based "sessions" involving thousands of simultaneous peer connections in a filesharing application. Plus, all of these come into cryptostorm from a wide range of local network configurations: some are already NATted through a residential router that is barely able to handle packet transit, whereas others are coming out of really well-administered academic or corporate network environments that might as well be their own bloody standalone ASes! And some members see sessions actively packet-shaped/throttled by ISPs who apparently feel it's ok to pinch down encrypted network traffic if they feel like it.

The net result is that a single "speed test" application cannot really give an accurate picture of network performance. In fact, most "speed test" apps are fairly simple: TCP-based tests that self-throttle as they see packets start to back up into source-device queues (otherwise they'd crash their own outbound machines if they kept shoveling packets into queue when their routers and/or NICs notice the queues are filling up for a given session). In a generic sense, that's fine - but when sessions come through cryptostorm, they are natively stripped down to the packet level, NATted through the kernel, and pushed off the physical NICs of our cluster servers as one big clump of packets (and the reverse, for inbound-from-member data - think of the topological model of the TUN interface mediating between the cryptostorm network daemon and the physical NIC's 'window to the world' layering). There's a series of packet (and socket assignment) queues that happen in this process - and a speed test app that doesn't know about that can throw very unreliable/inaccurate results through no fault of its own. It's just not designed to measure this kind of network topology.

Our in-house metrics for cluster performance are driven almost entirely by close attention to socket allocation overhead at the kernel level, as well as packet queues at the NIC. If sockets are assigning smoothly, and the NICs are able to onload/offload packets without having their buffers overflow into kernel space (and the attendant ring buffers involved), then our perf-tuning is successful. This is because, obviously, there's a huge amount of stochasticity in actual network traffic coming through an individual node/server: maybe there's 100 members connected, but not many using alot of bandwidth... or perhaps there's only 20 connected, but several are sitting on 100 megabit local pipes & are pushing big files through the network. Or: some of our nodes are heavily provisioned, some much less so and rather are clustered together to loadbalance amoungst themselves. So one machine might carry 100 sessions of high-traffic members just fine, whereas another would choke long before that.

In the end, the best "test data" come from members who tell us whether the network is performing on par with "bareback" non-cryptostorm sessions. That's the real metric. You hear alot of blabber about how "encryption slows down VPNs," and it's essentially all bullshit. I've yet to see kernel metrics on a VPN network that showed raw CPU bottlenecking as a result of simple application of symmetric crypto. Long before that, other areas of kernel performance are the cause of transit problems. These areas require much more experience and knowledge to diagnose and fine-tune, and so you see amateurs blame "crypto" when their servers are slow. That's nonsense. The OpenSSL libraries themselves, for all their flaws, are fast post-compile and work nicely with all the major chipset architectures at the binary level. They aren't where things slow down.

Perhaps the worst thing we see is some networks that, because their admins don't have any idea how to properly run machines, simply allow any one network session to effectively monopolise an entire server (or more often, VPS instance) during a "speed test." This is how you see really bad VPN networks show "good" speed test results. That one TCP session for a speedtest.net result has grabbed all the kernel's resources for packet transit, and is all but locking down the NIC as it pushes packets through. Sure, it shows 10 megabits/second download or whatever... but every other person logged into that machine just saw their sessions slow to a crawl or start dropping TCP packets entirely. Of course, the "test doesn't report that - since those other people have no idea they've been crowded out by that speedtest.net session. And the review goes up on some clicbbait blog somewhere: fast!

But if you're actually running a network to benefit all the network members, and not just to trick uneducated clickbait "reviewers" into saying your network is fast, then doing this is a terrible plan. You want everyone on the network to have consistently good network performance, whether they're pushing a big file across via TCP or whether they're gathering bits and pieces of obscure .torrents from a few hundred global swarm peers on ephemeral UDP connects. And you want the people streaming video to get reliable stream performance, plus those using videochat or other realtime apps to have non-glitchy sessions. That's a big-picture challenge that is NOT reflected by clicking an icon on speedtest.net and posting a screenshot of the result.

We know we have a performance problem when we get messages via our support folks that "the network seems slow" from a chunk of actual network members. We go into high-gear when that happens, as it's always "real" as compared to nonexistent "performance issues" that come when one person somewhere clicks on a speed test and worries that the numbers don't look high enough. Sometimes, that's a sign of a problem - but usually not. We're monitoring our machines closely enough, 24/7, that a simple problem like that will already have thrown red flags that are going to hit my monitory long before that. It's still good to get those reports, of course, but usually they're transient: an ISP bottleneck, a router that isn't handling port 443 UDP packets well, that sort of thing. But if a dozen members say that the Frankfurt cluster is slow, then I guarantee you there's a problem there - it's a question of hunting it down.

Perf-tuning cryptostorm is a really fascinating technical challenge: unlike most areas of network administration, it's mostly new questions that we ask and we can't really just go to Stack Exchange and see what other smart people are already doing. So we do alot of experimental parameter tweaking at the kernel level, within exitnode clusters, realtime. This kind of work is evolving into a full-time job, to be honest, as it's clear that there's still big gains to be made on overall, real-use-scenario performance at the network level. Probably a few dissertation topics lurking in there, too, as time goes by.

Anyway, when people want to see if cryptostorm meets their needs for performance we ask our support folks to provide them with testing tokens, and let actual network use be the standard. At that level, it is very very rare that someone tests the network and feels that it's slow in actual use (not just a speedtest.net result) - and if we do hear that, we listen closely as it's a chance to learn something important.

I am happy to share as much as people are interested in reading, when it comes to the specifics of how we perf-tune the clusters. I don't worry that some competitor will "steal" what we report, because doing this is alot of work and there's not some bash script that will just magically make it happen if someone wants to click on it. It requires careful attention to many layers of systems architecture - and anyone able to do that effectively is welcome to borrow what we've learned, at cryptostorm, in their own work. Hopefully, they'll share back their own results and experience - but even if they don't, we're not keeping stuff secret.

But sometimes I can see people's eyes glaze over when I talk about this stuff - to me it is fascinating, but not for everyone.

Finally, for people connecting to cryptostorm from Linux machines, I'm happy to provide some advice on tuning param options locally to ensure good throughput. The current kernel builds are pretty good about most stuff, but some of the packet buffering defaults are... mystifying to me, really. Those guys are super smart, so I am sure they have their reasons, but from my perspective I'd never stay with default kernel settings on one of my own local machines, in terms of network optimization. I assume this might also be true for Macs as well, since they're just Linux hiding under a layer of high-margin walled-garden obfuscation... but I don't know firsthand as it's not my world.

~ c_o

polarissucks01
Posts: 13
Joined: Tue Jan 21, 2014 8:25 am

Re: performance-tuning high-capacity cryptostorm sessions

Post by polarissucks01 » Sat Feb 08, 2014 2:06 am

I dont know if my message was deleted because its not supposed to be here or it just didnt post right. I will try again. For the last week I have been using the Montreal server. I download quite a lot. I have a 50megbit connection. Using Cryptostorm I always lose 7 of those megabits. Using other VPN services I normally lose 1-2. This problem is not only on torrenting. When I am on my Mac and am downloading something like software updates from Apple I normally see quite a large hit in my download speed also. PJ told me to post what was going on here. I have run a winmtr test to the montreal server and see no packet loss.

cryptostorm_ops

version 1.6 kernel session parameters (sysctl)

Post by cryptostorm_ops » Sat Feb 08, 2014 12:50 pm

This is the latest version (1.6) of the production sysctl parameters being tested in the Montreal cluster, as of this morning. Note that it is deployed only on one test node, to enable A/B performance monitoring. We've been looking closely at all the available data to see how we can best optimize the network for consistently strong throughput. This is something we will continue, as there is always room for improvement.

It is too early to say whether these further refinements in kernel network settings are proving effective, or not, as it takes some time for client sessions to pick up the new parameters (indirectly) and begin throwing data at the node in question at a different clip.

Code: Select all

# cryptostorm.is modded perf-tuned sysctl rev. 1.6
# CentOS 6.whatever - tweaked by p_j
# For binary values, 0 is disabled, 1 is enabled.

net.ipv4.ip_local_port_range = 32768 61000

# Decrease the time default value for connections to keep alive
net.ipv4.tcp_keepalive_time = 512
net.ipv4.tcp_keepalive_probes = 13
net.ipv4.tcp_keepalive_intvl = 32

# TCP window scaling for high-throughput, high-pingtime TCP performance
net.ipv4.tcp_window_scaling = 1

# Enables packet forwarding
net.ipv4.ip_forward = 1
net.ipv4.conf.all.forwarding = 1
net.ipv4.conf.default.forwarding = 1
net.ipv6.conf.all.forwarding = 1
net.ipv6.conf.default.forwarding = 1

# Disable IP spoofing protection, turn off source route verification
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0

# disabling SYN cookies to ensure best throughput, for now
net.ipv4.tcp_sack = 0
net.ipv4.tcp_dsack = 0
net.ipv4.tcp_fack = 0
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_synack_retries = 4
net.ipv4.tcp_max_syn_backlog = 65535

# Enables IP source routing
net.ipv4.conf.all.send_redirects = 1
net.ipv4.conf.default.send_redirects = 1
net.ipv4.conf.all.accept_source_route = 1
net.ipv4.conf.default.accept_source_route = 1
net.ipv6.conf.all.accept_source_route = 1
net.ipv6.conf.default.accept_source_route = 1

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename
# Useful for debugging multi-threaded applications
kernel.core_uses_pid = 1

# Don't ignore directed pings
net.ipv4.icmp_echo_ignore_all = 0

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536

# Controls the default maximum size of a message queue
kernel.msgmax = 65536

# specifies the minimum virtual address that a process is allowed to mmap
vm.mmap_min_addr = 4096

# How many times to retry killing an alive TCP connection
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_retries1 = 3

# Increase the maximum memory used to reassemble IP fragments
net.ipv4.ipfrag_high_thresh = 512000
net.ipv4.ipfrag_low_thresh = 446464

# maximum number of concurrent network sessions
fs.file-max = 360000

# Set maximum amount of memory allocated to shm to 256MB
kernel.shmmax = 268435456
kernel.shmall = 268435456

# per https://gist.github.com/kfox/1942782
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh3 = 16384
net.ipv4.neigh.default.gc_interval = 5
net.ipv4.neigh.default.base_reachable_time = 120
net.ipv4.neigh.default.gc_stale_time = 120
net.ipv4.neigh.default.base_reachable_time = 120
net.ipv4.neigh.default.gc_stale_time = 120
net.core.netdev_max_backlog = 262144
# net.core.rmem_default = 16777216
# net.core.optmem_max = 2048000
net.core.rmem_max = 108544
net.core.somaxconn = 262144
net.core.wmem_max = 108544
net.netfilter.nf_conntrack_max = 10000000
net.netfilter.nf_conntrack_tcp_timeout_established = 40
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 10
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 10
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 10
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 10
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 10
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 10
net.ipv4.tcp_fin_timeout = 32
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_timestamps = 0

# tuning TCP for web pdf
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 13107200

# Increase TCP queue length
net.ipv4.neigh.default.proxy_qlen = 96
net.ipv4.neigh.default.unres_qlen = 6

# Do a 'modprobe tcp_cubic' first
net.ipv4.tcp_congestion_control = cubic

# cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 0

# Enable a fix for RFC1337 - time-wait assassination hazards in TCP
net.ipv4.tcp_rfc1337 = 1

# UDP parameters
net.ipv4.udp_mem = 65536 173800 419430
net.ipv4.udp_rmem_min = 65536
net.ipv4.udp_wmem_min = 65536
 
# Enable ignoring broadcasts request
net.ipv4.icmp_echo_ignore_broadcasts = 1

# Enable bad error message Protection
net.ipv4.icmp_ignore_bogus_error_responses = 1

# Enable ICMP Redirect Acceptance
net.ipv4.conf.all.accept_redirects = 1
net.ipv4.conf.default.accept_redirects = 1
net.ipv6.conf.all.accept_redirects = 1
net.ipv6.conf.default.accept_redirects = 1

vm/min_free_kbytes = 65536

# Disable Log Spoofed Packets, Source Routed Packets, Redirect Packets
net.ipv4.conf.all.log_martians = 0
net.ipv4.conf.default.log_martians = 0

# disable ipv6
# net.ipv6.conf.all.disable_ipv6 = 1
# net.ipv6.conf.default.disable_ipv6 = 1
# net.ipv6.conf.lo.disable_ipv6 = 1

# This will ensure that immediately subsequent connections use the new values
net.ipv4.route.flush = 1
net.ipv6.route.flush = 1

User avatar
Pattern_Juggled
Posts: 612
Joined: Sun Dec 16, 2012 6:34 am
Contact:

Re: performance-tuning high-capacity cryptostorm sessions

Post by Pattern_Juggled » Fri Feb 21, 2014 7:25 pm

I'll just go ahead and leave this here...
righteous.png
...never, ever seen anything like it in more than six years providing network security service. Not even close. :wtf:
...just a scatterbrained network topologist & crypto systems architect……… ҉҉҉

[list]✨ ✨ ✨[/list]
pj@ðëëþ.bekeybase pgpmit pgpðørkßöt-on-consolegit 'er github
bitmessage:
BM-NBBqTcefbdgjCyQpAKFGKw9udBZzDr7f[/color]

User avatar
parityboy
Site Admin
Posts: 1262
Joined: Wed Feb 05, 2014 3:47 am

Re: performance-tuning high-capacity cryptostorm sessions

Post by parityboy » Fri Feb 21, 2014 10:03 pm

@PJ

The six day connection uptime? Yeah, I can attest to that. :D Not to give you a swelled head or anything, but the work you and the team have done in building CS has been nothing short of stunning. I'll be sticking with you and the team for a long time to come... :D

User avatar
Pattern_Juggled
Posts: 612
Joined: Sun Dec 16, 2012 6:34 am
Contact:

Re: performance-tuning high-capacity cryptostorm sessions

Post by Pattern_Juggled » Wed Oct 29, 2014 5:33 am

Bump, as this topic is coming up alot of late - mostly for Windows sessions, as I think our Linux session performance is currently doing quite well. Not sure wrt OSX - any feedback from the member community?

Cheers,

  • ~ pj

Locked