As far as technical challenges go, the Internet of Things is as tough as it gets:
- The scale is large: everything is huge.
- The power is low: there is almost none of it available.
- Wireless is weird: it keeps changing and it wasn’t very nice to begin with.
Today we look at these fundamental problems that every IoT platform must face and how we solve them in our Thingsquare IoT platform.
Be prepared for a somewhat lengthy article.
Scale: When everything is large, all bets are off
Many IoT deployments involve hundreds or thousands of individual devices. With large numbers of devices, even problems that normally would be unlikely to occur, are likely to occur.
Large networks are difficult to monitor in the field. But are even more challenging to work with during development.
At Thingsquare, we use these categories when we talk about development of IoT networks:
- Developer scale: 1-2 devices. When you have 1 or 2 wireless devices in front of you, it is relatively easy to understand what they are doing. You can add printouts or LEDs that blink when things happen, and as a developer, you can feel confident that you are in control. It is even possible to stop the execution of the software on one of the devices and single-step the program.
- Desktop scale: 2-5 devices. At this stage you can no longer control each device on its own, but you must treat them like a unit. They are still few enough to be able to monitor though, but you will have to use things like visually blink LEDs to allow you to see what is going on with them.
- Office scale: 5-10 devices. Now you have run out of space to fit the devices on a single desk and must spread them out over an area that begins to become difficult to monitor. And programming them with a new program starts to be a practical challenge, because you will have to physically connect and disconnect each device to the flash programmer.
- Floor scale: 10-100 devices. It now begins to be difficult to find space for all your devices in a single office and you will need to spread out over an entire floor space. This makes it difficult to visually see all devices so the only way to see what is going on is to do it through wireless communication – unless each device is connected with a wired backchannel, which itself is huge work to set up. Also, at this scale, hardware issues start to be seen: manufacturing of hardware can be somewhat flaky, and with 100 devices and a yield of 99%, chances are that one or more of the devices are physically broken.
- Deployment scale: 100-500 devices. This is a scale at which development rarely is done – development usually ends around the 100-device mark. But proof-of-concept rollouts for prototype testing and validation are common. At this scale, Internet connectivity issues start to affect the system. If parts of the system has different connectivity than others (say, because some parts of the system are connected with WiFi and others with 3G), things will behave differently in different parts of the network.
- City scale: 500-1000+ devices. At this scale, automated tools are needed to keep track of the behavior of the system. Also, if all devices are contained in a single network, simple operations start to take significant time. For example, sending a ping message to all devices will take several minutes just because the physical speed of the wireless network.
The strategies that we use at Thingsquare to deal with these challenges are:
- Simulation. We simulate everything, from the physical wireless layer, through the microprocessor layer, to high-level simulation of networks and devices.
- Testbed development. Every feature is developed in a set of testbeds, with the largest being 100 nodes.
- Lightweight crash reports. If the code crashes, the device will provide a brief but useful crash report.
- Regression testing. Every change in the code goes through rigorous automated testing in our simulators.
When dealing with large scale systems, you have very little visibility into what is going on.
And when dealing with IoT devices, which are wireless and don’t have much ability to store and transport logs, you have even less visibility.
Simulation is an essential tool to circumvent this. We use simulation at several layers:
- Wireless network simulation: we simulate the wireless network behavior of our system, thus making it possible to see what happens in the air at any given time.
- Microprocessor emulation: we emulate the processors that run the code, thereby allowing us to measure power and execution time at scale.
- Power simulation: in our network simulator and processor emulator, we track the power consumption of the code and communication so that we won’t need to measure everything on hardware.
Simulation is a powerful tool, but it cannot replace development on real hardware.
Sometimes you need to develop for a physical sensor or actuator. Then you need real hardware to interact with.
But more importantly, a simulator will not behave in the exact same way as the real world. And if you develop your solution entirely in simulation, it will likely break when faced with reality.
At the Thingsquare offices, we have a set of testbeds of increasing size:
- Two testbeds with 10 and 20 devices each.
- One testbed with 100 devices.
We use our testbeds both to develop new mechanisms and to continuously test our system. We can use them to replicate behavior we have seen in customer installations. We can also use them in test patters to run larger networks than we could physical fit into our offices.
Lightweight crash reports
Software crashes. Particularly during development. When the code crashes, a crash report can help the developer understand where and why the code crashes.
But for low-power IoT devices, there is not much room to store and transmit full crash dumps.
At Thingsquare, we use a lightweight technique to collect crash reports from devices:
- For every build that is uploaded to devices, the ELF binary is stored and tagged with the git commit ID for that build.
- If a device crashes, the program counter at the time of the crash is stored in non-volatile memory.
- When the device reboots after the crash, the commit ID and program counter at the crash site is reported to the backend.
This makes it possible to build a database of the memory addresses that caused crashes and the specific code revision that caused it. This allows the developers to investigate and identify what caused the crash – and fix the issue.
Regression testing is a standard software development technique to ensure that the software does not break as it is developed.
An IoT platform consists of many types of components, from backend software to the wireless devices. To perform regression testing, every component needs to be tested both by themselves and as a whole.
At Thingsquare, we use our simulators to perform full-platform regression testing for every change to make to the system. After the regression test is green, we test the system in our testbeds. The regression test suite is designed to catch fatal bugs, that could make the testbeds unusable.
Power: There isn’t much of it
The IoT may be powerful, but few things are as powerless as an IoT device.
The power consumption must often be as low as the spontaneous discharge of the battery.
Getting the power consumption down to such ridiculously low levels is both a science and an art. The science is in measuring and understanding the power consumption, either using software or hardware. The art is in knowing how to make good use of this information.
Power consumption is both a hardware and software issue. The hardware needs to be tuned the right way and support turning components off as much as possible. And the software needs to know what to turn off and when – and when it is safe to do so.
In IoT, the trickiest part to get right is usually the wireless communication. The radio draws a significant amount of power, but it is crucial so it can’t be blindly turned off. And the radio draws as much power when it is listening as it is when it is sending. And as network sizes grow, this becomes increasingly critical.
In the Thingsquare platform, we use a range of techniques to deal with the power problem:
- Hardware-based power measurement: we measure the power of the hardware using great tools.
- Software-based power measurement: each node keeps track of how much power is spends and periodically reports it.
- Lifetime estimation: based on the measured power data, we can estimate the lifetime of each device.
- Power tracking with anomaly detection: in large-scale systems, we use anomaly detection to see if any device happens to use more power than expected.
Hardware-based power measurement
The first step is to determine the power consumption of the raw hardware. One of the best ways to do this is a device called the Otii. We need to do this both to find any bugs in the hardware that can cause an increased power consumption, but also to determine the baseline power consumed by individual components of the hardware.
Measuring the power consumption of one device will not allow us to see the power consumption of an entire network. For that, we need ongoing measurements.
Software-based power measurement
Software-based power measurement allow us to continuously track the power consumption of each device.
Because the software completely controls the hardware, we only need to measure the time that each component is turned on to get a good estimate of the total power consumption. This data gets reported periodically by each device.
Since we now track the power consumption of each device, we can use this to estimate the lifetime of each device.
Power tracking with anomaly detection
When the number of devices grows, it becomes increasingly difficult to monitor individual device’s power consumption. We then need to introduce automated tools.
Because we collect power data from every device, we can use anomaly detection to highlight devices that have an unusually high power consumption. These devices need to be looked at closer – there might be a bug that causes this problem. And if we can find it during development, it won’t hit us as we deploy our solution.
Once we have identified a problematic device, we can dive into the details and look at the historical power consumption. We have found that averaging the power consumption over several timescales is useful: a 1 hour average is useful for spotting problems that repeat themselves over the course of a day, and a 24 hour average makes it easy to spot problems that occur in a weekly fashion.
The picture above shows a 24 hour average of the power consumption of a device. This device that apparently had an increased power consumption over several days in April.
Once we have identified that there is a problem, we can look deeper into why this happened. Without this ability to identify that there was a problem, this problem would have gone undetected and sneaked its way into production.
Wireless: It is weird
A lot of the IoT is about wireless networking. And wireless communication is weird.
One way to think about wireless communication is to think about it like light: it bounces around and gets obscured in unexpected ways. Wireless coverage may be good at one spot, but bad just one step away. Just like light from a lamp can be obscured, even close to the lamp.
Wireless signals may be stopped if something gets in their way. Many IoT solutions are deployed in locations where things move around. If something big moves in the way of a communication path, that communication path will be jammed.
Wireless communication is also heavily affected by other wireless communication. And different frequencies have different amounts of interference. The 2.4 GHz frequency band, which includes WiFi and Bluetooth, is a particularly tough space to be in. This is why many use other frequencies, such as sub-GHz communication.
Here is how we address these challenges in the Thingsquare system:
- Mesh networking: we use IPv6 mesh networking to route around obstacles.
- Frequency hopping: we use channel hopping to avoid wireless interference.
Mesh networking is a technique where devices help other devices to reach farther by repeating messages from others.
Instead of requiring that every device be in the range of the access point, this lets devices to be spread out into a larger area. It also lets the network route around obstacles, automatically.
The Thingsquare platform uses IPv6 networking with the RPL mesh routing protocol. All nodes continuously measure the connection quality to their neighbors, and may rearrange the routing graph if they find better quality links.
The mesh formation and maintenance process is entirely automatic. So the network can be extended simply by dropping extenders into the network.
Frequency hopping is a way to avoid spending too much time on a specific wireless channel. This is needed because that channel may be used by other communication.
For some frequency ranges, frequency hopping is a regulator requirement. Devices that do not properly switch channels must not be deployed.
The Thingsquare platform uses frequency hopping both to comply with regulations and to enable multiple networks at the same location. Each network has its own hop schedule, which makes the networks interfere as little as possible with each other. Separate networks also have different security keys, but keeping the frequencies separate makes the system more efficient.
The Internet of Things is a significant technical challenge because of the large scale of deployments, the power requirements, and the wireless communication.
Fortunately, by using an IoT platform, you do not need to face these challenges directly. They should have already been solved by the platform.
The Thingsquare IoT platform supports networks with hundreds to thousands of devices in each network, extremely low power consumption, and employ mesh networking and frequency hopping to address these fundamental IoT challenges.
Thingsquare help companies take action and see results with the Internet of Things. Get in touch with us today to discuss how we can help your IoT solution come true!