clib.conversion._to_numpy: Add tests for PyArrow's timestamp type#3621
clib.conversion._to_numpy: Add tests for PyArrow's timestamp type#3621
Conversation
| _HAS_PYARROW = True | ||
| except ImportError: | ||
|
|
||
| class pa: # noqa: N801 |
There was a problem hiding this comment.
pa.timestamp() with tz doesn't have string aliases, so we can't use strings like timestamp[s, UTC]. So we have to define a dummy function here.
There was a problem hiding this comment.
Could we avoid this dummy class by starting from a pandas.Timestamp instead of Python datetime? The timezone info can be carried over:
import pyarrow as pa
import pandas as pd
from zoneinfo import ZoneInfo
array = pa.array(
[
pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
]
)
print(array.type)
# timestamp[us, tz=America/New_York]There was a problem hiding this comment.
We can, but I feel it will make the pytest parametrize more complicated.
There was a problem hiding this comment.
Ok, let's keep this dummy class then.
| pytest.param( | ||
| pa.timestamp("s", tz="America/New_York"), | ||
| "datetime64[s]", | ||
| id="timestamp[s, tz=America/New_York]", | ||
| ), | ||
| pytest.param( | ||
| pa.timestamp("s", tz="+07:30"), | ||
| "datetime64[s]", | ||
| id="timestamp[s, tz=+07:30]", | ||
| ), | ||
| ], | ||
| ) | ||
| def test_to_numpy_pyarrow_timestamp(dtype, expected_dtype): | ||
| """ | ||
| Test the _to_numpy function with PyArrow arrays of PyArrow datetime types. | ||
|
|
||
| pyarrow.timestamp(unit, tz=None) can accept units "s", "ms", "us", and "ns". | ||
|
|
||
| Reference: https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html | ||
| """ | ||
| data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)] | ||
| array = pa.array(data, type=dtype) |
There was a problem hiding this comment.
Is the timezone part of the pa.timestamp type actually working?
data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
print(array)
# [
# 2024-01-02 03:04:05Z,
# 2024-01-02 03:04:06Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T03:04:05' '2024-01-02T03:04:06']If using pd.Timestamp instead
import pandas as pd
import pyarrow as pa
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
from pygmt.clib.conversion import _to_numpy
array = pa.array(
[
pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
]
)
print(array)
# [
# 2024-01-02 08:04:05.000000Z,
# 2024-01-02 08:04:06.000000Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T08:04:05.000000' '2024-01-02T08:04:06.000000']
For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours.
There was a problem hiding this comment.
It seems the timezone information is applied but not shown when representing the array.
In [1]: from datetime import datetime
In [2]: import pyarrow as pa
# For pa.scalar, it's shown with timezone offset
In [3]: pa.scalar(datetime(2024, 1, 2, 3, 4, 5), type=pa.timestamp("s", tz="America/New_York"))
Out[3]: <pyarrow.TimestampScalar: '2024-01-01T22:04:05-0500'>
In [4]: array = pa.array([datetime(2024, 1, 2, 3, 4, 5)], type=pa.timestamp("s", tz="America/New_York"))
# For pa.array, it's shown without timezone offset
In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f52feace740>
[
2024-01-02 03:04:05Z
]
# When converted to pandas, it's shown with timezone offset
In [6]: array.to_pandas()
Out[6]:
0 2024-01-01 22:04:05-05:00
dtype: datetime64[s, America/New_York]
# When converted to numpy, timezone information is lost, since numpy doesn's have tz support.
In [7]: array.to_numpy()
Out[7]: array(['2024-01-02T03:04:05'], dtype='datetime64[s]')There was a problem hiding this comment.
Is the timezone part of the
pa.timestamptype actually working?data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)] array = pa.array(data, type=pa.timestamp("s", tz="America/New_York")) print(array) # [ # 2024-01-02 03:04:05Z, # 2024-01-02 03:04:06Z # ] result = _to_numpy(array) print(result) # ['2024-01-02T03:04:05' '2024-01-02T03:04:06']If using
pd.Timestampinsteadimport pandas as pd import pyarrow as pa from datetime import datetime, timezone from zoneinfo import ZoneInfo from pygmt.clib.conversion import _to_numpy array = pa.array( [ pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")), pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")), ] ) print(array) # [ # 2024-01-02 08:04:05.000000Z, # 2024-01-02 08:04:06.000000Z # ] result = _to_numpy(array) print(result) # ['2024-01-02T08:04:05.000000' '2024-01-02T08:04:06.000000']For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours.
Also need to note the difference between datetime and pd.Timestamp. datetime is timezone-unware, while pd.Timestamp has timezone support.
So, in your first example:
- actual datetime:
2024-01-02T03:04:05+00:00 - In pyarrow.array, the actual datetime is
2024-01-01T22:04:05-05:00, but UTC time (2024-01-02T03:04:05) is shown - to_numpy:
2024-01-02T03:04:05(UTC time)
In your 2nd example:
- the actual datetime:
2024-01-02T03:04:05-05:00 - In pyarrow.array: the UTC time (
2024-01-02 08:04:05) is shown - to_numpy:
2024-01-02T08:04:05
So, I think there is no inconsistency. The pyarrow.array has timezone stored internally, but its repr always shows UTC time. And when converting to numpy array (using either its own to_numpy() method or np.ascontinuousarray()), the datetime are always converted to UTC since numpy.datetime64 doesn't have timezone support.
There was a problem hiding this comment.
https://stackoverflow.com/a/73276431
Arrow internally stores datetime as UTC + timezone info and will print it as such
There was a problem hiding this comment.
Ok, I see now that there is indeed consistency, thanks for clearing this up, the repr on pyarrow.array is indeed confusing 😅 . So in summary:
| class/type | has timezone support | Link |
|---|---|---|
Python datetime |
✅ | https://docs.python.org/3/library/datetime.html#datetime.datetime.tzinfo |
| pandas.Timestamp | ✅ | https://pandas.pydata.org/pandas-docs/version/2.2/reference/api/pandas.Timestamp.html |
| pyarrow.timestamp (type) | ✅ | https://arrow.apache.org/docs/17.0/python/generated/pyarrow.timestamp.html |
| numpy.datetime64 | ❌ | https://numpy.org/doc/2.1/reference/arrays.scalars.html#numpy.datetime64 |
Since we're converting to NumPy, the timezone information will always be lost (everything is converted to UTC). What does this mean for PyGMT/GMT? Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC?
There was a problem hiding this comment.
As shown below, GMT itself doesn't have timezone support, similar to numpy.datetime64. The big difference is that GMT simply ignore the TZ infor, while the to_numpy conversion always converts to UTC.
gmt begin map
gmt basemap -R2024-01-01T00:00:00+08:00/2024-01-02T00:00:00+08:00/1/10 -JX10c/4c -Baf
echo 2024-01-01T08:00:00+08:00 5 | gmt plot -Sc1c -Gblack
gmt end show
Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC?
I think yes, unless we try hard to drop the TZ information before converting to numpy.datetime64. Below is an example, but if we want to support TZ, we need to find a more general way to do it.
In [1]: import pyarrow as pa
In [2]: from datetime import datetime
In [3]: data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
In [4]: array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f78c07bafe0>
[
2024-01-02 03:04:05Z,
2024-01-02 03:04:06Z
]
In [6]: array.to_pandas().dt.tz_localize(None)
Out[6]:
0 2024-01-01 22:04:05
1 2024-01-01 22:04:06
dtype: datetime64[s]
In [7]: import numpy as np
In [8]: np.ascontiguousarray(array.to_pandas().dt.tz_localize(None))
Out[8]:
array(['2024-01-01T22:04:05', '2024-01-01T22:04:06'],
dtype='datetime64[s]')There was a problem hiding this comment.
Maybe we need to open an issue to discuss this - whether to:
- Follow GMT (allow users to plot data at a non-UTC timezone, by ignoring the timezone offset)
- Follow NumPy, whereby the data will from any non-UTC timezone will be converted to UTC always.
If going with 2, we should at least raise a warning if a non-UTC timezone is used, that a conversion is taking place. If going with 1, we would need to special-case datetime types, that might mean extra logic in the _to_numpy function, or having to keep array_to_datetime (so don't merge #3507 yet).
There was a problem hiding this comment.
I'm also debating whether to following option 1 or 2, so a separate issue sounds good.
Since the current behavior is option 2 (as done in array_to_datetime), I think we should go with option 2 first, then revisit the timezone support later, which will be a breaking change anyway.
There was a problem hiding this comment.
Ok, I'll open an issue to discuss (edit: see #3656). And sure, we can go with option 2 for now.
weiji14
left a comment
There was a problem hiding this comment.
Pre-approving, but see optional suggestions below.
| _HAS_PYARROW = True | ||
| except ImportError: | ||
|
|
||
| class pa: # noqa: N801 |
There was a problem hiding this comment.
Ok, let's keep this dummy class then.
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>

pyarrow.timestampis the datetime dtype in pyarrow. It has two parameters,unitandtz.This PR adds tests to ensure
_to_numpysupportpyarrow.timestamp. Please note that,unitcan only bes/ms/us/ns, andActually, when converting to np.datetime64, timezone is corrected to UTC.tzinformation is discarded when converting to np.datetime64.Related to #3600.