This was a fun afternoon reverse engineering project so I figured I’d write a bit about it.
I’m developing a web app, Pumbaa Backtester, which is a small tool to simulate the historical performance of index-based investments. As part of the development I wanted to fetch long-term historical data for an ETF traded at a medium-size stock exchange. I won’t write exactly which one, but if you are curious you’ll figure it out.
Each day a closing price for the ETF is determined, which is pretty much like the price of a stock at the end of the trading day. What I needed are closing prices since the ETF was created 22 years ago. Browsing a bit at the stock exchange’s site I got to the following form:
Great! This gives the data I want. The goal is to automate fetching these prices - I want it to be done automatically once a day. So let’s fire up Firefox network monitor (Ctrl+Shift+E) and see what happens when we press “Search”:
Looks simple enough - an API with the parameters
isin (unique id of the ETF),
If we attempt to access the API with curl:
we get back an empty JSON response. At this point the most likely possibility is that we are missing one of the HTTP headers, it can be a Cookie header or something else. Looking at the original request’s headers, everything is quite standard except for the trio
These are not documented on MDN so they must be something unique to this API. You could wonder if we could just send the exact same headers again, and yes it works for a few minutes, but then stops working. We’ll have to find out the logic behind them.
Client-Date is simple enough, it’s just the current time. The other two are 16-byte hex encoded strings, so maybe they are just random UUIDs? Let’s try:
Finding the origin
Searching for the string
X-Client-TraceId through the JS scripts that the page uses, we find the culprit:
At first I tried to approach this like a programmer, understanding where each variable comes from. But in a 90k-line script where everything is called
r it’s quite impossible. It doesn’t help that the surrounding code looks like some form of alien code:
So let’s just use some common sense and go header-by-header:
Here is the snippet again for convenience:
We can guess that it’s a hash of the current time, after being converted to the format
YYYYMMDDHHmm. Which hash? The result is 16-byte long so the most probable candidate is md5. Let’s check:
Doesn’t match… maybe we need to use the local time instead?
It matches! So we got this header as well.
Here’s the relevant part again:
Leveraging what we found out already, this header is generated as follows:
- The current time,
e, is concatenated to two unknown strings,
X-Client-TraceIdis the md5 hash of the result.
The Firefox debugger (Ctrl+Shift+Z) lets us beautify the script and put a breakpoint on this line. Hitting “Search” again the breakpoint is triggered, and we can see the variables’ values:
tis the requested URL, including the query string.
saltis a fixed string, in this case
w4icATTGtnjAZMbkL3kJwxMfEAKDa3MN. Apparently it appears in the source code as-is so it must be constant.
X-Client-TraceIdis the md5 of
time + url + salt.
Now we have all the information needed to generate valid requests to the API:
- Take the current time and hash it to generate
- Construct the URL with the parameters, add it to the time and salt and hash everything together to generate
And it works! Here is a Python snippet to generate the headers for a given URL:
What was the purpose of these headers? I’m really not sure. It could be protection against bots or maybe a user-tracking mechanism. Anyway, it didn’t take much work to understand it. I guess if you are exposing your API on the internet, expect someone to figure it out and use it.