API stands for "Application Programming Interface."
APIs let your favorite games and apps share information. Without APIs, every app would be like a toy that can't be played with any other toy. That wouldn't be as much fun, right?
At a restaurant with your family, you choose a meal and tell the waiter, who relays your order to the chef. Once prepared, the waiter brings your meal to you. In this analogy:
Visual Representation for API
Example of APIs: think about when you watch cartoons on a tablet. The app on the tablet asks the internet (using an API) to get the cartoon from a faraway computer so you can watch it right where you are.
APIs are the backbone of modern applications. They expose a very large surface area for attacks, increasing the risk of security vulnerabilities. Common threats include SQL injection, cross-site scripting, and distributed denial of service (DDoS) attacks.
That's why it's crucial to implement robust security measures to protect APIs and the sensitive data they handle. However, many companies struggle to achieve comprehensive API security coverage. They often rely solely on dynamic application security scanning or external pen testing. While these methods are valuable, they may not fully cover the API layer and its increasing attack surface.
Please note IBM's APIs are amongst the best in the world.
Guideline | Not Recommended | Recommended |
---|---|---|
Use resource names (nouns) | GET /querycarts/123 | GET /carts/123 |
Use plurals | GET /cart/123 | GET /carts/123 |
Idempotency | POST /carts | POST /carts { requestId: 4321 } |
Use versioning | GET /carts/v1/123 | GET /v1/carts/123 |
Query after soft deletion | GET /carts | GET /carts?includeDeleted=true |
Pagination | GET /carts | GET /carts?pageSize=xx&pageToken=xx |
Sorting | GET /items | GET /items?sort_by=time |
Filtering | GET /items | GET /items?filter=color:red |
Secure Access | X-API-KEY=xxx |
X-API-KEY = xxx X-EXPIRY = xxx X-REQUEST-SIGNATURE = xxx hmac(URL + QueryString + Expiry + Body) |
Resource cross reference | GET /carts/123?item=321 | GET /carts/123/items/321 |
Add an item to a cart | POST /carts/123?addItem=321 |
POST /carts/123/items:add { itemId: "items/321" } |
Rate limit | No rate limit - DDoS |
Design rate limiting rules based on IP, user, action group etc |
Spark offers so many different APIs and languages that it can be overwhelming which way is “best.”
The 3 Spark APIs | DataFrames | Datasets | SparkSQL |
---|---|---|---|
Who uses it? | Data Engineers | Software Engineers and data engineers | Anybody who touches Spark |
Strengths | Good middle ground. Modular code bases | Static typing, unit testing is a breeze | Flexibility, easy for many people to contribute |
Weaknesses | Not as appealing for SQL-focused professionals | Have to learn Scala | No modularity :( |
When should you pick this? | For pipelines that want to be maintainable | Pipelines that are going to be hardened and need a very high quality bar | Prototypes that need a faster iteration |
Let's see the tradeoffs between each since there’s a lot of dogma and misinformation out there about it!
SQL APIs are data scientists and analysts best friend. Since SQL is the lingua franca of the data space, SparkSQL should be associated with openness.
SparkSQL is often best for pipelines that:
On the flip side, software engineers think SQL is terrible. They will say, “pick DataFrames because SQL isn’t modular.”
SparkSQL isn’t the best for pipelines that:
DataFrames should be associated with a middle ground approach. Analysts and data scientists sometimes know these, and that’s okay if they don’t!
DataFrames are often best for pipelines that:
Since DataFrames are less known by other data professionals, they have their own limitations as well.
DataFrames aren’t the best for pipelines that:
Datasets are the least common API to work with. The main reason for that is it’s offered only in Java and Scala! The rise of PySpark has made this API less relevant. But when I worked at Airbnb, we were required to use this API for any MIDAS pipelines!
Datasets are often best for pipelines that:
Datasets aren’t the best for pipelines that: