Temporal Windows – Design and Implement a Data Stream Processing Solution

Chapter 3, “Data Sources and Ingestion,” introduced the temporal concept, which focuses on data that is valid at a specific moment in time or in a given time frame. A temporal table, also referred to as a history table, can be seen in Figure 3.20. The meaning is the same in the context of a data stream processing solution. The primary difference is the time frame. In the serving layer context, the time frame could be years, months, weeks, or days, whereas in the data stream processing context, seconds and perhaps minutes are used most often. Chapter 3 includes detailed coverage of temporal windows available in Azure Stream Analytics. As you might recall, they are hopping, session, sliding, snapshot, and tumbling. The following code snippet is an example of a query that stores the result in a temporal tumbling window:

SELECT
    System.TimeStamp() AS IngestionTime,
    PERCENTILE_CONT(0.5) OVER (ORDER BY brainwaves.ALPHA) AS medianAPLHA,
    PERCENTILE_CONT(0.5) OVER (ORDER BY brainwaves.BETA_H) AS medianBETA_H,
    PERCENTILE_CONT(0.5) OVER (ORDER BY brainwaves.BETA_L) AS medianBETA_L,
    PERCENTILE_CONT(0.5) OVER (ORDER BY brainwaves.GAMMA) AS medianGAMMA,
    PERCENTILE_CONT(0.5) OVER (ORDER BY brainwaves.THETA) AS medianTHETA
INTO powerBI
FROM brainwaves
GROUP BY IngestionTime, TumblingWindow(second, 5)

Each brain wave reading sent to Azure Stream Analytics over a 5‐second time frame is bundled together into a window. The data is received from the input named brainwaves, which you created and configured in Exercise 3.17. Then the SELECT statement, which calculates median values per frequency, is performed on the data that was received in that 5‐second window. The median values are then sent to output named powerBI for consumption and visualization.

Data Format

As you saw in Table 7.1, the only stream product that does not support all standard data formats is Azure Stream Analytics, which currently supports only UTF‐8, JSON, CSV, and AVRO; Azure Stream Analytics does not support the Parquet, XML, and ORC file types. If your streaming solution requires those file types, then you need to find another streaming product, such as Azure Databricks or HDInsight. Chapter 2, “CREATE DATABASE dbName; GO,” discussed the most common file types used for data analytics, and Chapter 3 provided some additional information about the data formats and file types that have to do specifically with designing an Azure Data Lake solution.

Programming Paradigm

As shown in Table 7.1, Azure Stream Analytics supports only the declarative programming paradigm. The database querying language T‐SQL is one of the most common declarative programming languages. Since the Stream Analytics query language is a subset of T‐SQL Azure Stream Analytics, it is clearly declarative. A few differences between declarative and imperative programming have to do with control flow and idempotency. Control flow is what distinguishes declarative from imperative, in that control flow is not described in the declarative paradigm. That means that in the declarative programming paradigm, your approach is to define what you want the code to do, but you do not control exactly how it gets done. The example SQL statement in the “Temporal Windows” section illustrates this. The SQL statement is clear on what you want but does not dictate how to perform the computation. If the SQL statement ran procedurally, it would run the INTO command before the FROM command, which might result in placing the data into the output before it knows where to get the input from. The order (the control flow) is not relevant in the declarative paradigm. Consider the following Python HDInsight 3.6 Storm code snippet, which is imperative:

import storm
def process(self, tup):
    words = tup.values[0].split()
    for word in words:
        storm.logInfo(“Emitting %s” % word)
        count = self._counter[word] +=1
        storm.emit([word, count])

An imperative program comprises commands the computer will perform. The code snippet describes how the program operates, step‐by‐step. Objects, like words, are instantiated, and assignments are made to both words and word, which are further traits of the imperative programming paradigm. The other difference between these two paradigms has to do with idempotency. Idempotency, in computer terms, means that a method or code snippet that is run multiple times returns the same expected results. An idempotent algorithm does not have to keep track of whether the operation has been triggered before. The Python code snippet provided earlier is not idempotent, because the emitted value would contain an unexpected count value, meaning the result of the method invocation would not be the same when run multiple times. If your data analytics require a lot of control over how the data should be processed, then you would need to choose a streaming product other than Azure Stream Analytics. Alternatively, if you need to only define what you want and let the underlying technology decide the best means for providing you that, then you can use one of the other declarative streaming products available on Azure.

DP-203 Azure Data Engineer

Data Format

Programming Paradigm

Katie Cox

Leave a Reply Cancel reply