Process Data Across Partitions – Design and Implement a Data Stream Processing Solution

Streaming scenarios often do not meet the requirements necessary to achieve the parallelization achieved by partitioning. The most obvious example is related to the brainjammer brain wave solution implemented in this chapter. Partitioning cannot be utilized because the output type used for the data stream is Power BI, which does not support partitioning. Other reasons that prevent parallelized execution are mismatched partition counts and multistep queries. As mentioned briefly in the previous section, in order to achieve maximum parallel execution into, within, and out of the Azure Stream Analytics jobs, the number of partitions existing on the input type and output type must be the same. The other scenario that results in reduced parallelization efficiency has to do with multistep queries with different PARTITION BY values. A multistep query is similar to the one used in Exercise 7.5, where there exists more than a single SELECT statement, each of which could contain a PARTITION BY clause that uses different values for creating the partition. The first step could use both the PartitionId and a ReadingId, and the second use only the ReadingId. An example of this is shown in the following pseudo query:

WITH BrainwaveResults AS (SELECTSystem.TimeStamp() AS IngestionTime,
PERCENTILE_CONT(0.5)OVER(ORDERBYbrainwaves.ALPHA)ASmedianAPLHA,
brainwaves.ReadingId,PartitiionIdFROM brainwaves PARTITION BY PartitionId, ReadingIdGROUP BY IngestionTime, TumblingWindow(second, 5)),
ScenarioDetection AS (SELECT medianAPLHAFROM BrainwaveResults PARTITION BY ReadingId)

Remember that the PARTITION BY clause is necessary only when you are using Azure Stream Analytics with a compatibility version of less than 1.2. The Compatibility Level blade for the Azure Stream Analytics job shows the current setting, which you can change if required. Figure 7.32 illustrates the blade in the Azure portal.

FIGURE 7.32 The Azure Stream Analytics Compatibility Level blade

There is one more point to call out here regarding parallelization in a scenario where your solution doesn’t meet the requirements for end‐to‐end partitioning. It was stated that in order to achieve maximum parallelization, your solution must meet the requirements, and that just because all requirements are not met, it doesn’t mean you can gain some parallelization. Some parallelization is better than none. Even though using a nonpartitioned output type like Power BI means that a portion of the data stream would flow more slowly, there still can be gains made using the input type and Azure Stream Analytics itself. For example, without including a partition key with the data stream sent to the Event Hubs endpoint, the data is load balanced across all partitions. In a scenario where there are only two and the volume and velocity are relatively low, this can still work fine. However, when you have, for example, between 100,000 and 1,000,000 events per minute spread across the maximum of 32 partitions, it is feasible that some kind of shuffling happens to order to put the data into a queryable form. By passing a partition key to Event Hubs, all events with the same partition key will flow into the same Event Hubs partition, which means that no shuffling is required and a single Azure Stream Analytics partition (i.e., a node) will perform the coded analytics on that subset of windowed data.

DP-203 Azure Data Engineer

Katie Cox

Leave a Reply Cancel reply