Application programming Interface

What is an API?

API stands for "Application Programming Interface."

Application: Just like the games you love to play on a tablet or computer.
Programming: It's like giving instructions to your computer to make it do what you want.
Interface: Think of it as a way of talking or communicating.

API representation

Why Are APIs Important?

APIs let your favorite games and apps share information. Without APIs, every app would be like a toy that can't be played with any other toy. That wouldn't be as much fun, right?

How Does an API Work?

At a restaurant with your family, you choose a meal and tell the waiter, who relays your order to the chef. Once prepared, the waiter brings your meal to you. In this analogy:

You represent a computer program requesting information.
The waiter acts as the API, delivering requests and returning responses.
The chef symbolizes another program that provides the needed information.

Visual Representation for API

Example of APIs: think about when you watch cartoons on a tablet. The app on the tablet asks the internet (using an API) to get the cartoon from a faraway computer so you can watch it right where you are.

API architectural styles

Cheat Sheet for API

API Protocols

9 types of API testing

API testing

Smoke Testing: This is done after API development is complete. Simply validate if the APIs are working and nothing breaks.
Functional Testing: This creates a test plan based on the functional requirements and compares the results with the expected results.
Integration Testing: This test combines several API calls to perform end-to-end tests. The intra-service communications and data transmissions are tested.
Regression Testing: This test ensures that bug fixes or new features shouldn’t break the existing behaviors of APIs.
Load Testing: This tests applications’ performance by simulating different loads. Then we can calculate the capacity of the application.
Stress Testing: We deliberately create high loads to the APIs and test if the APIs are able to function normally.
Security Testing: This tests the APIs against all possible external threats.
UI Testing: This tests the UI interactions with the APIs to make sure the data can be displayed properly.
Fuzz Testing: This injects invalid or unexpected input data into the API and tries to crash the API. In this way, it identifies the API vulnerabilities.

API Gateway

API security best practice

APIs are the backbone of modern applications. They expose a very large surface area for attacks, increasing the risk of security vulnerabilities. Common threats include SQL injection, cross-site scripting, and distributed denial of service (DDoS) attacks.

That's why it's crucial to implement robust security measures to protect APIs and the sensitive data they handle. However, many companies struggle to achieve comprehensive API security coverage. They often rely solely on dynamic application security scanning or external pen testing. While these methods are valuable, they may not fully cover the API layer and its increasing attack surface.

Please note IBM's APIs are amongst the best in the world.

12 tips for API security

API architecture styles

How to multiply your API performance

API performance

Code first vs API first

Design effective & safe API's

Guideline	Not Recommended	Recommended
Use resource names (nouns)	GET /querycarts/123	GET /carts/123
Use plurals	GET /cart/123	GET /carts/123
Idempotency	POST /carts	POST /carts { requestId: 4321 }
Use versioning	GET /carts/v1/123	GET /v1/carts/123
Query after soft deletion	GET /carts	GET /carts?includeDeleted=true
Pagination	GET /carts	GET /carts?pageSize=xx&pageToken=xx
Sorting	GET /items	GET /items?sort_by=time
Filtering	GET /items	GET /items?filter=color:red
Secure Access	X-API-KEY=xxx	X-API-KEY = xxx X-EXPIRY = xxx X-REQUEST-SIGNATURE = xxx hmac(URL + QueryString + Expiry + Body)
Resource cross reference	GET /carts/123?item=321	GET /carts/123/items/321
Add an item to a cart	POST /carts/123?addItem=321	POST /carts/123/items:add { itemId: "items/321" }
Rate limit	No rate limit - DDoS	Design rate limiting rules based on IP, user, action group etc

API Pagination

API pagination

Spark APIs

Spark offers so many different APIs and languages that it can be overwhelming which way is “best.”

The 3 Spark APIs	DataFrames	Datasets	SparkSQL
Who uses it?	Data Engineers	Software Engineers and data engineers	Anybody who touches Spark
Strengths	Good middle ground. Modular code bases	Static typing, unit testing is a breeze	Flexibility, easy for many people to contribute
Weaknesses	Not as appealing for SQL-focused professionals	Have to learn Scala	No modularity :(
When should you pick this?	For pipelines that want to be maintainable	Pipelines that are going to be hardened and need a very high quality bar	Prototypes that need a faster iteration

SparkSQL vs DataFrame vs Dataset

Let's see the tradeoffs between each since there’s a lot of dogma and misinformation out there about it!

The SparkSQL API The DataFrame API The Dataset API

SQL APIs are data scientists and analysts best friend. Since SQL is the lingua franca of the data space, SparkSQL should be associated with openness.

SparkSQL is often best for pipelines that:

Are built in collaboration with non-engineers
Are subject to a lot of mutation and change
Only work on data sources that are already in the warehouse / data lake

On the flip side, software engineers think SQL is terrible. They will say, “pick DataFrames because SQL isn’t modular.”

SparkSQL isn’t the best for pipelines that:

Leverage 3^rd party sources such as REST APIs, Kafka topics, or GraphQL
Have complex integration with other systems (e.g. compiles server libraries)
Need extensive unit and integration test coverage
Need modularity

DataFrames should be associated with a middle ground approach. Analysts and data scientists sometimes know these, and that’s okay if they don’t!

DataFrames are often best for pipelines that:

Require fewer changes and are more “hardened”
Have 3rd party integrations from REST APIs or other non-table sources
Need extensive unit and integration test coverage (Chispa is pretty good for PySpark testing)

Since DataFrames are less known by other data professionals, they have their own limitations as well.

DataFrames aren’t the best for pipelines that:

Need collaboration between many non-engineer professionals
Need static typing guarantees that the Dataset API offers

Datasets are the least common API to work with. The main reason for that is it’s offered only in Java and Scala! The rise of PySpark has made this API less relevant. But when I worked at Airbnb, we were required to use this API for any MIDAS pipelines!

Datasets are often best for pipelines that:

Need static typing guarantees. This makes CI/CD much more powerful than Python-based pipelines. Unit testing with Datasets is so good!
Are owned by strong JVM-based developers. If your company has tons of strong Java and Scala engineers or you have a backend like Spring Boot or something like that, the integrations here can be powerful!
Are part of a larger ecosystem of pipelines with many dependencies. Python dependency management is terrible. Scala’s is vastly superior. Gradle makes pip look like little league!

Datasets aren’t the best for pipelines that:

Are owned by engineers that don’t want to cry learning Scala
Need faster iteration cycles. Uploading a built JAR is significantly slower than uploading a PySpark script or a SparkSQL
Need to be collaborated on by many non-engineers